• Open

    Deep Learning Prediction of Severe Health Risks for Pediatric COVID-19 Patients with a Large Feature Set in 2021 BARDA Data Challenge. (arXiv:2206.01696v1 [cs.LG])
    Most children infected with COVID-19 have no or mild symptoms and can recover automatically by themselves, but some pediatric COVID-19 patients need to be hospitalized or even to receive intensive medical care (e.g., invasive mechanical ventilation or cardiovascular support) to recover from the illnesses. Therefore, it is critical to predict the severe health risk that COVID-19 infection poses to children to provide precise and timely medical care for vulnerable pediatric COVID-19 patients. However, predicting the severe health risk for COVID-19 patients including children remains a significant challenge because many underlying medical factors affecting the risk are still largely unknown. In this work, instead of searching for a small number of most useful features to make prediction, we design a novel large-scale bag-of-words like method to represent various medical conditions and measurements of COVID-19 patients. After some simple feature filtering based on logistical regression, the large set of features is used with a deep learning method to predict both the hospitalization risk for COVID-19 infected children and the severe complication risk for the hospitalized pediatric COVID-19 patients. The method was trained and tested on the datasets of the Biomedical Advanced Research and Development Authority (BARDA) Pediatric COVID-19 Data Challenge held from Sept. 15 to Dec. 17, 2021. The results show that the approach can rather accurately predict the risk of hospitalization and severe complication for pediatric COVID-19 patients and deep learning is more accurate than other machine learning methods.  ( 2 min )
    A Fair Empirical Risk Minimization with Generalized Entropy. (arXiv:2202.11966v2 [cs.LG] UPDATED)
    Recently a parametric family of fairness metrics to quantify algorithmic fairness has been proposed based on generalized entropy which have been originally used in economics and public welfare. Since these metrics have several advantages such as quantifying unfairness at the individual-level and group-level, and unfold trade-off between the individual fairness and group-level fairness, algorithmic fairness requirement may be given in terms of generalized entropy for a fair classification problem. We consider a fair empirical risk minimization with a fairness constraint specified by generalized entropy. We theoretically investigate if the fair empirical fair classification problem is learnable and how to find an approximate optimal classifier of it.  ( 2 min )
    Zero-Shot Bird Species Recognition by Learning from Field Guides. (arXiv:2206.01466v1 [cs.CV])
    We exploit field guides to learn bird species recognition, in particular zero-shot recognition of unseen species. The illustrations contained in field guides deliberately focus on discriminative properties of a species, and can serve as side information to transfer knowledge from seen to unseen classes. We study two approaches: (1) a contrastive encoding of illustrations that can be fed into zero-shot learning schemes; and (2) a novel method that leverages the fact that illustrations are also images and as such structurally more similar to photographs than other kinds of side information. Our results show that illustrations from field guides, which are readily available for a wide range of species, are indeed a competitive source of side information. On the iNaturalist2021 subset, we obtain a harmonic mean from 749 seen and 739 unseen classes greater than $45\%$ (@top-10) and $15\%$ (@top-1). Which shows that field guides are a valuable option for challenging real-world scenarios with many species.  ( 2 min )
    Revisiting the "Video" in Video-Language Understanding. (arXiv:2206.01720v1 [cs.CV])
    What makes a video task uniquely suited for videos, beyond what can be understood from a single image? Building on recent progress in self-supervised image-language models, we revisit this question in the context of video and language tasks. We propose the atemporal probe (ATP), a new model for video-language analysis which provides a stronger bound on the baseline accuracy of multimodal models constrained by image-level understanding. By applying this model to standard discriminative video and language tasks, such as video question answering and text-to-video retrieval, we characterize the limitations and potential of current video-language benchmarks. We find that understanding of event temporality is often not necessary to achieve strong or state-of-the-art performance, even compared with recent large-scale video-language models and in contexts intended to benchmark deeper video-level understanding. We also demonstrate how ATP can improve both video-language dataset and model design. We describe a technique for leveraging ATP to better disentangle dataset subsets with a higher concentration of temporally challenging data, improving benchmarking efficacy for causal and temporal understanding. Further, we show that effectively integrating ATP into full video-level temporal models can improve efficiency and state-of-the-art accuracy.  ( 2 min )
    Rashomon Capacity: A Metric for Predictive Multiplicity in Probabilistic Classification. (arXiv:2206.01295v1 [cs.LG])
    Predictive multiplicity occurs when classification models with nearly indistinguishable average performances assign conflicting predictions to individual samples. When used for decision-making in applications of consequence (e.g., lending, education, criminal justice), models developed without regard for predictive multiplicity may result in unjustified and arbitrary decisions for specific individuals. We introduce a new measure of predictive multiplicity in probabilistic classification called Rashomon Capacity. Prior metrics for predictive multiplicity focus on classifiers that output thresholded (i.e., 0-1) predicted classes. In contrast, Rashomon Capacity applies to probabilistic classifiers, capturing more nuanced score variations for individual samples. We provide a rigorous derivation for Rashomon Capacity, argue its intuitive appeal, and demonstrate how to estimate it in practice. We show that Rashomon Capacity yields principled strategies for disclosing conflicting models to stakeholders. Our numerical experiments illustrate how Rashomon Capacity captures predictive multiplicity in various datasets and learning models, including neural networks. The tools introduced in this paper can help data scientists measure, report, and ultimately resolve predictive multiplicity prior to model deployment.  ( 2 min )
    Which Explanation Should I Choose? A Function Approximation Perspective to Characterizing Post hoc Explanations. (arXiv:2206.01254v1 [cs.LG])
    Despite the plethora of post hoc model explanation methods, the basic properties and behavior of these methods and the conditions under which each one is effective are not well understood. In this work, we bridge these gaps and address a fundamental question: Which explanation method should one use in a given situation? To this end, we adopt a function approximation perspective and formalize the local function approximation (LFA) framework. We show that popular explanation methods are instances of this framework, performing function approximations of the underlying model in different neighborhoods using different loss functions. We introduce a no free lunch theorem for explanation methods which demonstrates that no single method can perform optimally across all neighbourhoods and calls for choosing among methods. To choose among methods, we set forth a guiding principle based on the function approximation perspective, considering a method to be effective if it recovers the underlying model when the model is a member of the explanation function class. Then, we analyze the conditions under which popular explanation methods are effective and provide recommendations for choosing among explanation methods and creating new ones. Lastly, we empirically validate our theoretical results using various real world datasets, model classes, and prediction tasks. By providing a principled mathematical framework which unifies diverse explanation methods, our work characterizes the behaviour of these methods and their relation to one another, guides the choice of explanation methods, and paves the way for the creation of new ones.  ( 2 min )
    Reinforcement Learning with Fast Stabilization in Linear Dynamical Systems. (arXiv:2007.12291v2 [cs.LG] UPDATED)
    In this work, we study model-based reinforcement learning (RL) in unknown stabilizable linear dynamical systems. When learning a dynamical system, one needs to stabilize the unknown dynamics in order to avoid system blow-ups. We propose an algorithm that certifies fast stabilization of the underlying system by effectively exploring the environment with an improved exploration strategy. We show that the proposed algorithm attains $\tilde{\mathcal{O}}(\sqrt{T})$ regret after $T$ time steps of agent-environment interaction. We also show that the regret of the proposed algorithm has only a polynomial dependence in the problem dimensions, which gives an exponential improvement over the prior methods. Our improved exploration method is simple, yet efficient, and it combines a sophisticated exploration policy in RL with an isotropic exploration strategy to achieve fast stabilization and improved regret. We empirically demonstrate that the proposed algorithm outperforms other popular methods in several adaptive control tasks.  ( 2 min )
    Instance-dependent Label-noise Learning under a Structural Causal Model. (arXiv:2109.02986v3 [stat.ML] UPDATED)
    Label noise will degenerate the performance of deep learning algorithms because deep neural networks easily overfit label errors. Let X and Y denote the instance and clean label, respectively. When Y is a cause of X, according to which many datasets have been constructed, e.g., SVHN and CIFAR, the distributions of P(X) and P(Y|X) are entangled. This means that the unsupervised instances are helpful to learn the classifier and thus reduce the side effect of label noise. However, it remains elusive on how to exploit the causal information to handle the label noise problem. In this paper, by leveraging a structural causal model, we propose a novel generative approach for instance-dependent label-noise learning. In particular, we show that properly modeling the instances will contribute to the identifiability of the label noise transition matrix and thus lead to a better classifier. Empirically, our method outperforms all state-of-the-art methods on both synthetic and real-world label-noise datasets.  ( 2 min )
    Compositional Visual Generation with Composable Diffusion Models. (arXiv:2206.01714v1 [cs.CV])
    Large text-guided diffusion models, such as DALLE-2, are able to generate stunning photorealistic images given natural language descriptions. While such models are highly flexible, they struggle to understand the composition of certain concepts, such as confusing the attributes of different objects or relations between objects. In this paper, we propose an alternative structured approach for compositional generation using diffusion models. An image is generated by composing a set of diffusion models, with each of them modeling a certain component of the image. To do this, we interpret diffusion models as energy-based models in which the data distributions defined by the energy functions may be explicitly combined. The proposed method can generate scenes at test time that are substantially more complex than those seen in training, composing sentence descriptions, object relations, human facial attributes, and even generalizing to new combinations that are rarely seen in the real world. We further illustrate how our approach may be used to compose pre-trained text-guided diffusion models and generate photorealistic images containing all the details described in the input descriptions, including the binding of certain object attributes that have been shown difficult for DALLE-2. These results point to the effectiveness of the proposed method in promoting structured generalization for visual generation.  ( 2 min )
    Three-dimensional microstructure generation using generative adversarial neural networks in the context of continuum micromechanics. (arXiv:2206.01693v1 [cond-mat.mtrl-sci])
    Multiscale simulations are demanding in terms of computational resources. In the context of continuum micromechanics, the multiscale problem arises from the need of inferring macroscopic material parameters from the microscale. If the underlying microstructure is explicitly given by means of microCT-scans, convolutional neural networks can be used to learn the microstructure-property mapping, which is usually obtained from computational homogenization. The CNN approach provides a significant speedup, especially in the context of heterogeneous or functionally graded materials. Another application is uncertainty quantification, where many expansive evaluations are required. However, one bottleneck of this approach is the large number of training microstructures needed. This work closes this gap by proposing a generative adversarial network tailored towards three-dimensional microstructure generation. The lightweight algorithm is able to learn the underlying properties of the material from a single microCT-scan without the need of explicit descriptors. During prediction time, the network can produce unique three-dimensional microstructures with the same properties of the original data in a fraction of seconds and at consistently high quality.  ( 2 min )
    Hydra: A System for Large Multi-Model Deep Learning. (arXiv:2110.08633v6 [cs.DC] UPDATED)
    Scaling up model depth and size is now a common approach to raise accuracy in many deep learning (DL) applications, as evidenced by the widespread success of multi-billion or even trillion parameter models in natural language processing (NLP) research. Despite success in DL research and at major technology companies, broader practical adoption of such large models among domain scientists and businesses is still bottlenecked by GPU memory limits, high training costs, and low GPU availability, even on public clouds. Model selection needs further compound these resource challenges: users often need to compare dozens of models with different hyper-parameters or neural architectures to suit their specific task and dataset. In this paper, we present Hydra, a system designed to tackle such challenges by enabling out-of-the-box scaling for multi-large-model DL workloads on even commodity GPUs in a resource-efficient manner. Hydra is the first approach to holistically optimize the execution of multi-model workloads for large DL models. We do this by adapting prior "model-parallel" execution schemes to work with scalable parameter offloading across the memory hierarchy and further hybridizing this approach with task-parallel job scheduling techniques. Hydra decouples scalability of model parameters from parallelism of execution, thus enabling DL users to train even a 6-billion parameter model on a single commodity GPU. It also fully exploits the speedup potential of task parallelism in multi-GPU setups, yielding near-linear strong scaling and making rigorous model selection perhaps more practical for such models. We evaluate end-to-end performance by fine-tuning GPT-2 for language modeling. We find that Hydra offers between 50% and 100% higher training throughput than even the best settings of state-of-the-art industrial frameworks such as DeepSpeed and GPipe for multi-large-model training.  ( 3 min )
    Learning Soft Constraints From Constrained Expert Demonstrations. (arXiv:2206.01311v1 [cs.LG])
    Inverse reinforcement learning (IRL) methods assume that the expert data is generated by an agent optimizing some reward function. However, in many settings, the agent may optimize a reward function subject to some constraints, where the constraints induce behaviors that may be otherwise difficult to express with just a reward function. We consider the setting where the reward function is given, and the constraints are unknown, and propose a method that is able to recover these constraints satisfactorily from the expert data. While previous work has focused on recovering hard constraints, our method can recover cumulative soft constraints that the agent satisfies on average per episode. In IRL fashion, our method solves this problem by adjusting the constraint function iteratively through a constrained optimization procedure, until the agent behavior matches the expert behavior. Despite the simplicity of the formulation, our method is able to obtain good results. We demonstrate our approach on synthetic environments and real world highway driving data.  ( 2 min )
    NanoBatch Privacy: Enabling fast Differentially Private learning on the IPU. (arXiv:2109.12191v2 [cs.LG] UPDATED)
    Differentially private SGD (DPSGD) has recently shown promise in deep learning. However, compared to non-private SGD, the DPSGD algorithm places computational overheads that can undo the benefit of batching in GPUs. Micro-batching is a common method to alleviate this and is fully supported in the TensorFlow Privacy library (TFDP). However, it degrades accuracy. We propose NanoBatch Privacy, a lightweight add-on to TFDP to be used on Graphcore IPUs by leveraging batch size of 1 (without microbatching) and gradient accumulation. This allows us to achieve large total batch sizes with minimal impacts to throughput. Second, we illustrate using Cifar-10 how larger batch sizes are not necessarily optimal from a privacy versus utility perspective. On ImageNet, we achieve more than 15x speedup over TFDP versus 8x A100s and significant speedups even across libraries such as Opacus. We also provide two extensions: 1) DPSGD for pipelined models and 2) per-layer clipping that is 15x faster than the Opacus implementation on 8x A100s. Finally as an application case study, we apply NanoBatch training for use on private Covid-19 chest CT prediction.  ( 2 min )
    Supernet Training for Federated Image Classification under System Heterogeneity. (arXiv:2206.01366v1 [cs.LG])
    Efficient deployment of deep neural networks across many devices and resource constraints, especially on edge devices, is one of the most challenging problems in the presence of data-privacy preservation issues. Conventional approaches have evolved to either improve a single global model while keeping each local training data decentralized (i.e., data-heterogeneity) or to train a once-for-all network that supports diverse architectural settings to address heterogeneous systems equipped with different computational capabilities (i.e., model-heterogeneity). However, little research has considered both directions simultaneously. In this work, we propose a novel framework to consider both scenarios, namely Federation of Supernet Training (FedSup), where clients send and receive a supernet whereby it contains all possible architectures sampled from itself. It is inspired by how averaging parameters in the model aggregation stage of Federated Learning (FL) is similar to weight-sharing in supernet training. Specifically, in the FedSup framework, a weight-sharing approach widely used in the training single shot model is combined with the averaging of Federated Learning (FedAvg). Under our framework, we present an efficient algorithm (E-FedSup) by sending the sub-model to clients in the broadcast stage for reducing communication costs and training overhead. We demonstrate several strategies to enhance supernet training in the FL environment and conduct extensive empirical evaluations. The resulting framework is shown to pave the way for the robustness of both data- and model-heterogeneity on several standard benchmarks.  ( 2 min )
    Compressive Fourier collocation methods for high-dimensional diffusion equations with periodic boundary conditions. (arXiv:2206.01255v1 [math.NA])
    High-dimensional Partial Differential Equations (PDEs) are a popular mathematical modelling tool, with applications ranging from finance to computational chemistry. However, standard numerical techniques for solving these PDEs are typically affected by the curse of dimensionality. In this work, we tackle this challenge while focusing on stationary diffusion equations defined over a high-dimensional domain with periodic boundary conditions. Inspired by recent progress in sparse function approximation in high dimensions, we propose a new method called compressive Fourier collocation. Combining ideas from compressive sensing and spectral collocation, our method replaces the use of structured collocation grids with Monte Carlo sampling and employs sparse recovery techniques, such as orthogonal matching pursuit and $\ell^1$ minimization, to approximate the Fourier coefficients of the PDE solution. We conduct a rigorous theoretical analysis showing that the approximation error of the proposed method is comparable with the best $s$-term approximation (with respect to the Fourier basis) to the solution. Using the recently introduced framework of random sampling in bounded Riesz systems, our analysis shows that the compressive Fourier collocation method mitigates the curse of dimensionality with respect to the number of collocation points under sufficient conditions on the regularity of the diffusion coefficient. We also present numerical experiments that illustrate the accuracy and stability of the method for the approximation of sparse and compressible solutions.  ( 2 min )
    ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero. (arXiv:1902.04522v5 [cs.AI] UPDATED)
    The AlphaGo, AlphaGo Zero, and AlphaZero series of algorithms are remarkable demonstrations of deep reinforcement learning's capabilities, achieving superhuman performance in the complex game of Go with progressively increasing autonomy. However, many obstacles remain in the understanding of and usability of these promising approaches by the research community. Toward elucidating unresolved mysteries and facilitating future research, we propose ELF OpenGo, an open-source reimplementation of the AlphaZero algorithm. ELF OpenGo is the first open-source Go AI to convincingly demonstrate superhuman performance with a perfect (20:0) record against global top professionals. We apply ELF OpenGo to conduct extensive ablation studies, and to identify and analyze numerous interesting phenomena in both the model training and in the gameplay inference procedures. Our code, models, selfplay datasets, and auxiliary data are publicly available at https://ai.facebook.com/tools/elf-opengo/.
    Learning programs by combining programs. (arXiv:2206.01614v1 [cs.LG])
    The goal of inductive logic programming is to induce a set of rules (a logic program) that generalises examples. Inducing programs with many rules and literals is a major challenge. To tackle this challenge, we decompose programs into \emph{non-separable} fragments, learn fragments separately, and then combine them. We implement our approach in a generate, test, combine, and constrain loop. Our anytime approach can learn optimal, recursive, and large programs and supports predicate invention. Our experiments on multiple domains (including program synthesis and inductive general game playing) show that our approach can increase predictive accuracies and reduce learning times compared to existing approaches.
    What I Cannot Predict, I Do Not Understand: A Human-Centered Evaluation Framework for Explainability Methods. (arXiv:2112.04417v2 [cs.CV] UPDATED)
    A multitude of explainability methods and associated fidelity performance metrics have been proposed to help better understand how modern AI systems make decisions. However, much of the current work has remained theoretical -- without much consideration for the human end-user. In particular, it is not yet known (1) how useful current explainability methods are in practice for more real-world scenarios and (2) how well associated performance metrics accurately predict how much knowledge individual explanations contribute to a human end-user trying to understand the inner-workings of the system. To fill this gap, we conducted psychophysics experiments at scale to evaluate the ability of human participants to leverage representative attribution methods for understanding the behavior of different image classifiers representing three real-world scenarios: identifying bias in an AI system, characterizing the visual strategy it uses for tasks that are too difficult for an untrained non-expert human observer as well as understanding its failure cases. Our results demonstrate that the degree to which individual attribution methods help human participants better understand an AI system varied widely across these scenarios. This suggests a critical need for the field to move past quantitative improvements of current attribution methods towards the development of complementary approaches that provide qualitatively different sources of information to human end-users.
    A Fast and Convergent Proximal Algorithm for Regularized Nonconvex and Nonsmooth Bi-level Optimization. (arXiv:2203.16615v2 [cs.LG] UPDATED)
    Many important machine learning applications involve regularized nonconvex bi-level optimization. However, the existing gradient-based bi-level optimization algorithms cannot handle nonconvex or nonsmooth regularizers, and they suffer from a high computation complexity in nonconvex bi-level optimization. In this work, we study a proximal gradient-type algorithm that adopts the approximate implicit differentiation (AID) scheme for nonconvex bi-level optimization with possibly nonconvex and nonsmooth regularizers. In particular, the algorithm applies the Nesterov's momentum to accelerate the computation of the implicit gradient involved in AID. We provide a comprehensive analysis of the global convergence properties of this algorithm through identifying its intrinsic potential function. In particular, we formally establish the convergence of the model parameters to a critical point of the bi-level problem, and obtain an improved computation complexity $\mathcal{O}(\kappa^{3.5}\epsilon^{-2})$ over the state-of-the-art result. Moreover, we analyze the asymptotic convergence rates of this algorithm under a class of local nonconvex geometries characterized by a {\L}ojasiewicz-type gradient inequality. Experiment on hyper-parameter optimization demonstrates the effectiveness of our algorithm.
    Causal Transformer for Estimating Counterfactual Outcomes. (arXiv:2204.07258v2 [cs.LG] UPDATED)
    Estimating counterfactual outcomes over time from observational data is relevant for many applications (e.g., personalized medicine). Yet, state-of-the-art methods build upon simple long short-term memory (LSTM) networks, thus rendering inferences for complex, long-range dependencies challenging. In this paper, we develop a novel Causal Transformer for estimating counterfactual outcomes over time. Our model is specifically designed to capture complex, long-range dependencies among time-varying confounders. For this, we combine three transformer subnetworks with separate inputs for time-varying covariates, previous treatments, and previous outcomes into a joint network with in-between cross-attentions. We further develop a custom, end-to-end training procedure for our Causal Transformer. Specifically, we propose a novel counterfactual domain confusion loss to address confounding bias: it aims to learn adversarial balanced representations, so that they are predictive of the next outcome but non-predictive of the current treatment assignment. We evaluate our Causal Transformer based on synthetic and real-world datasets, where it achieves superior performance over current baselines. To the best of our knowledge, this is the first work proposing transformer-based architecture for estimating counterfactual outcomes from longitudinal data.
    An alternative approach to train neural networks using monotone variational inequality. (arXiv:2202.08876v2 [stat.ML] UPDATED)
    Despite the vast empirical success of neural networks, theoretical understanding of the training procedures remains limited, especially in providing performance guarantees of testing performance due to the non-convex nature of the optimization problem. The current paper investigates an alternative approach of neural network training by reducing to another problem with convex structure -- to solve a monotone variational inequality (MVI) -- inspired by a recent work of (Juditsky & Nemirovsky, 2019). The solution to MVI can be found by computationally efficient procedures, and importantly, this leads to performance guarantee of $\ell_2$ and $\ell_{\infty}$ bounds on model recovery and prediction accuracy under the theoretical setting of training a single-layer linear neural network. In addition, we study the use of MVI for training multi-layer neural networks and propose a practical algorithm called \textit{stochastic variational inequality} (SVI), and demonstrate its applicability in training fully-connected neural networks and graph neural networks (GNN) (SVI is completely general and can be used to train other types of neural networks). We demonstrate the competitive or better performance of SVI compared to widely-used stochastic gradient descent methods on both synthetic and real network data prediction tasks regarding various performance metrics, especially in the improved efficiency in the early stage of training.
    Slot Order Matters for Compositional Scene Understanding. (arXiv:2206.01370v1 [cs.CV])
    Empowering agents with a compositional understanding of their environment is a promising next step toward solving long-horizon planning problems. On the one hand, we have seen encouraging progress on variational inference algorithms for obtaining sets of object-centric latent representations ("slots") from unstructured scene observations. On the other hand, generating scenes from slots has received less attention, in part because it is complicated by the lack of a canonical object order. A canonical object order is useful for learning the object correlations necessary to generate physically plausible scenes similar to how raster scan order facilitates learning pixel correlations for pixel-level autoregressive image generation. In this work, we address this lack by learning a fixed object order for a hierarchical variational autoencoder with a single level of autoregressive slots and a global scene prior. We cast autoregressive slot inference as a set-to-sequence modeling problem. We introduce an auxiliary loss to train the slot prior to generate objects in a fixed order. During inference, we align a set of inferred slots to the object order obtained from a slot prior rollout. To ensure the rolled out objects are meaningful for the given scene, we condition the prior on an inferred global summary of the input. Experiments on compositional environments and ablations demonstrate that our model with global prior, inference with aligned slot order, and auxiliary loss achieves state-of-the-art sample quality.
    It's DONE: Direct ONE-shot learning with Hebbian weight imprinting. (arXiv:2204.13361v2 [cs.LG] UPDATED)
    Learning a new concept from one example is a superior function of human brain and it is drawing attention in the field of machine learning as one-shot learning task. In this paper, we propose the simplest method for this task with a nonparametric weight imprinting, named Direct ONE-shot learning (DONE). DONE adds new classes to a pretrained deep neural network (DNN) classifier with neither training optimization nor pretrained-DNN modification. DONE is inspired by Hebbian theory and directly uses the neural activity input of the final dense layer obtained from a data that belongs to the new additional class as the connectivity weight (synaptic strength) with a newly-provided-output neuron for the new class, by transforming all statistical properties of the neural activity into those of synaptic strength. DONE requires just one inference for learning a new concept and its procedure is simple, deterministic, not requiring parameter tuning and hyperparameters. The performance of DONE depends entirely on the pretrained DNN model used as a backbone model, and we confirmed that DONE with a well-trained backbone model performs a practical-level accuracy. DONE has some advantages including a DNN's practical use that is difficult to spend high cost for a training, an evaluation of existing DNN models, and the understanding of the brain. DONE might be telling us one-shot learning is an easy task that can be achieved by a simple principle not only for humans but also for current well-trained DNN models.
    JARVix at SemEval-2022 Task 2: It Takes One to Know One? Idiomaticity Detection using Zero and One Shot Learning. (arXiv:2202.02394v4 [cs.CL] UPDATED)
    Large Language Models have been successful in a wide variety of Natural Language Processing tasks by capturing the compositionality of the text representations. In spite of their great success, these vector representations fail to capture meaning of idiomatic multi-word expressions (MWEs). In this paper, we focus on the detection of idiomatic expressions by using binary classification. We use a dataset consisting of the literal and idiomatic usage of MWEs in English and Portuguese. Thereafter, we perform the classification in two different settings: zero shot and one shot, to determine if a given sentence contains an idiom or not. N shot classification for this task is defined by N number of common idioms between the training and testing sets. In this paper, we train multiple Large Language Models in both the settings and achieve an F1 score (macro) of 0.73 for the zero shot setting and an F1 score (macro) of 0.85 for the one shot setting. An implementation of our work can be found at https://github.com/ashwinpathak20/Idiomaticity_Detection_Using_Few_Shot_Learning.
    HierAttn: Effectively Learn Representations from Stage Attention and Branch Attention for Skin Lesions Diagnosis. (arXiv:2205.04326v5 [eess.IV] UPDATED)
    Accurate and unbiased examinations of skin lesions are critical for the early diagnosis and treatment of skin conditions and disorders. Visual features of skin lesions vary significantly because the images are collected from patients with different lesion colours and morphologies by using dissimilar imaging equipment. Recent studies have reported ensembled convolutional neural networks (CNNs) to classify the images for early diagnosis of skin disorders. However, the practical use of these ensembled CNNs is limited because they are heavyweight and inadequate for using contextual information. Although lightweight networks (e.g., MobileNetV3 and EfficientNet) were developed to achieve parameters reduction for implementing deep neural networks on mobile devices, insufficient depth of feature representation restricts the performance. To address the existing limitations, we introduce a new lite and effective neural network, namely HierAttn. The HierAttn applies a novel strategy to learn the local and global features by using multi-stage and multi-branch attention mechanisms. The efficacy of HierAttn was evaluated by using the dermoscopy images dataset ISIC2019 and smartphone photos dataset PAD-UFES-20 (PAD20). The experimental results show that HierAttn achieves the best accuracy and AUC among the state-of-the-art lightweight networks. The code is available at https://github.com/anthonyweidai/HierAttn.
    On the Generalization of Wasserstein Robust Federated Learning. (arXiv:2206.01432v1 [cs.LG])
    In federated learning, participating clients typically possess non-i.i.d. data, posing a significant challenge to generalization to unseen distributions. To address this, we propose a Wasserstein distributionally robust optimization scheme called WAFL. Leveraging its duality, we frame WAFL as an empirical surrogate risk minimization problem, and solve it using a local SGD-based algorithm with convergence guarantees. We show that the robustness of WAFL is more general than related approaches, and the generalization bound is robust to all adversarial distributions inside the Wasserstein ball (ambiguity set). Since the center location and radius of the Wasserstein ball can be suitably modified, WAFL shows its applicability not only in robustness but also in domain adaptation. Through empirical evaluation, we demonstrate that WAFL generalizes better than the vanilla FedAvg in non-i.i.d. settings, and is more robust than other related methods in distribution shift settings. Further, using benchmark datasets we show that WAFL is capable of generalizing to unseen target domains.
    On the Benefits of Large Learning Rates for Kernel Methods. (arXiv:2202.13733v2 [stat.ML] UPDATED)
    This paper studies an intriguing phenomenon related to the good generalization performance of estimators obtained by using large learning rates within gradient descent algorithms. First observed in the deep learning literature, we show that a phenomenon can be precisely characterized in the context of kernel methods, even though the resulting optimization problem is convex. Specifically, we consider the minimization of a quadratic objective in a separable Hilbert space, and show that with early stopping, the choice of learning rate influences the spectral decomposition of the obtained solution on the Hessian's eigenvectors. This extends an intuition described by Nakkiran (2020) on a two-dimensional toy problem to realistic learning scenarios such as kernel ridge regression. While large learning rates may be proven beneficial as soon as there is a mismatch between the train and test objectives, we further explain why it already occurs in classification tasks without assuming any particular mismatch between train and test data distributions.
    Non-Intrusive Reduced Models based on Operator Inference for Chaotic Systems. (arXiv:2206.01604v1 [cs.LG])
    This work explores the physics-driven machine learning technique Operator Inference (OpInf) for predicting the state of chaotic dynamical systems. OpInf provides a non-intrusive approach to infer approximations of polynomial operators in reduced space without having access to the full order operators appearing in discretized models. Datasets for the physics systems are generated using conventional numerical solvers and then projected to a low-dimensional space via Principal Component Analysis (PCA). In latent space, a least-squares problem is set to fit a quadratic polynomial operator which is subsequently employed in a time-integration scheme in order to produce extrapolations in the same space. Once solved, the inverse PCA operation is applied for reconstructing the extrapolations in the original space. The quality of the OpInf predictions is assessed via the Normalized Root Mean Squared Error (NRMSE) metric from which the Valid Prediction Time (VPT) is computed. Numerical experiments considering the chaotic systems Lorenz 96 and the Kuramoto-Sivashinsky equation show promising forecasting capabilities of the OpInf reduced order models with VPT ranges that outperform state-of-the-art machine learning methods such as backpropagation and reservoir computing recurrent neural networks [1]. The best results based on randomized initial conditions show that Lorenz 96 system can be forecasted up to 6.66 or 3.19 Lyapunov time units corresponding to the forcing terms F=8 and F=10, respectively, while the KS system achieved remarkable 794 Lyapunov time units.  ( 2 min )
    Alternating Synthetic and Real Gradients for Neural Language Modeling. (arXiv:1902.10630v2 [cs.LG] UPDATED)
    Training recurrent neural networks (RNNs) with backpropagation through time (BPTT) has known drawbacks such as being difficult to capture longterm dependencies in sequences. Successful alternatives to BPTT have not yet been discovered. Recently, BP with synthetic gradients by a decoupled neural interface module has been proposed to replace BPTT for training RNNs. On the other hand, it has been shown that the representations learned with synthetic and real gradients are different though they are functionally identical. In this project, we explore ways of combining synthetic and real gradients with application to neural language modeling tasks. Empirically, we demonstrate the effectiveness of alternating training with synthetic and real gradients after periodic warm restarts on language modeling tasks.
    Measuring Unintended Memorisation of Unique Private Features in Neural Networks. (arXiv:2202.08099v1 [cs.LG] CROSS LISTED)
    Neural networks pose a privacy risk to training data due to their propensity to memorise and leak information. Focusing on image classification, we show that neural networks also unintentionally memorise unique features even when they occur only once in training data. An example of a unique feature is a person's name that is accidentally present on a training image. Assuming access to the inputs and outputs of a trained model, the domain of the training data, and knowledge of unique features, we develop a score estimating the model's sensitivity to a unique feature by comparing the KL divergences of the model's output distributions given modified out-of-distribution images. Our results suggest that unique features are memorised by multi-layer perceptrons and convolutional neural networks trained on benchmark datasets, such as MNIST, Fashion-MNIST and CIFAR-10. We find that strategies to prevent overfitting (e.g.\ early stopping, regularisation, batch normalisation) do not prevent memorisation of unique features. These results imply that neural networks pose a privacy risk to rarely occurring private information. These risks can be more pronounced in healthcare applications if patient information is present in the training data.
    Safety Certification for Stochastic Systems via Neural Barrier Functions. (arXiv:2206.01463v1 [eess.SY])
    Providing non-trivial certificates of safety for non-linear stochastic systems is an important open problem that limits the wider adoption of autonomous systems in safety-critical applications. One promising solution to address this problem is barrier functions. The composition of a barrier function with a stochastic system forms a supermartingale, thus enabling the computation of the probability that the system stays in a safe set over a finite time horizon via martingale inequalities. However, existing approaches to find barrier functions for stochastic systems generally rely on convex optimization programs that restrict the search of a barrier to a small class of functions such as low degree SoS polynomials and can be computationally expensive. In this paper, we parameterize a barrier function as a neural network and show that techniques for robust training of neural networks can be successfully employed to find neural barrier functions. Specifically, we leverage bound propagation techniques to certify that a neural network satisfies the conditions to be a barrier function via linear programming and then employ the resulting bounds at training time to enforce the satisfaction of these conditions. We also present a branch-and-bound scheme that makes the certification framework scalable. We show that our approach outperforms existing methods in several case studies and often returns certificates of safety that are orders of magnitude larger.
    Meta-Auto-Decoder for Solving Parametric Partial Differential Equations. (arXiv:2111.08823v2 [cs.LG] UPDATED)
    Partial Differential Equations (PDEs) are ubiquitous in many disciplines of science and engineering and notoriously difficult to solve. In general, closed-form solutions of PDEs are unavailable and numerical approximation methods are computationally expensive. The parameters of PDEs are variable in many applications, such as inverse problems, control and optimization, risk assessment, and uncertainty quantification. In these applications, our goal is to solve parametric PDEs rather than one instance of them. Our proposed approach, called Meta-Auto-Decoder (MAD), treats solving parametric PDEs as a meta-learning problem and utilizes the Auto-Decoder structure in \cite{park2019deepsdf} to deal with different tasks/PDEs. Physics-informed losses induced from the PDE governing equations and boundary conditions is used as the training losses for different tasks. The goal of MAD is to learn a good model initialization that can generalize across different tasks, and eventually enables the unseen task to be learned faster. The inspiration of MAD comes from (conjectured) low-dimensional structure of parametric PDE solutions and we explain our approach from the perspective of manifold learning. Finally, we demonstrate the power of MAD though extensive numerical studies, including Burgers' equation, Laplace's equation and time-domain Maxwell's equations. MAD exhibits faster convergence speed without losing the accuracy compared with other deep learning methods.
    PAC Statistical Model Checking of Mean Payoff in Discrete- and Continuous-Time MDP. (arXiv:2206.01465v1 [eess.SY])
    Markov decision processes (MDP) and continuous-time MDP (CTMDP) are the fundamental models for non-deterministic systems with probabilistic uncertainty. Mean payoff (a.k.a. long-run average reward) is one of the most classic objectives considered in their context. We provide the first algorithm to compute mean payoff probably approximately correctly in unknown MDP; further, we extend it to unknown CTMDP. We do not require any knowledge of the state space, only a lower bound on the minimum transition probability, which has been advocated in literature. In addition to providing probably approximately correct (PAC) bounds for our algorithm, we also demonstrate its practical nature by running experiments on standard benchmarks.
    Learning with convolution and pooling operations in kernel methods. (arXiv:2111.08308v2 [stat.ML] UPDATED)
    Recent empirical work has shown that hierarchical convolutional kernels inspired by convolutional neural networks (CNNs) significantly improve the performance of kernel methods in image classification tasks. A widely accepted explanation for their success is that these architectures encode hypothesis classes that are suitable for natural images. However, understanding the precise interplay between approximation and generalization in convolutional architectures remains a challenge. In this paper, we consider the stylized setting of covariates (image pixels) uniformly distributed on the hypercube, and characterize exactly the RKHS of kernels composed of single layers of convolution, pooling, and downsampling operations. We use this characterization to compute sharp asymptotics of the generalization error for any given function in high-dimension. In particular, we quantify the gain in sample complexity brought by enforcing locality with the convolution operation and approximate translation invariance with average pooling. Notably, these results provide a precise description of how convolution and pooling operations trade off approximation with generalization power in one layer convolutional kernels.
    BioADAPT-MRC: Adversarial Learning-based Domain Adaptation Improves Biomedical Machine Reading Comprehension Task. (arXiv:2202.13174v2 [cs.CL] UPDATED)
    Biomedical machine reading comprehension (biomedical-MRC) aims to comprehend complex biomedical narratives and assist healthcare professionals in retrieving information from them. The high performance of modern neural network-based MRC systems depends on high-quality, large-scale, human-annotated training datasets. In the biomedical domain, a crucial challenge in creating such datasets is the requirement for domain knowledge, inducing the scarcity of labeled data and the need for transfer learning from the labeled general-purpose (source) domain to the biomedical (target) domain. However, there is a discrepancy in marginal distributions between the general-purpose and biomedical domains due to the variances in topics. Therefore, direct-transferring of learned representations from a model trained on a general-purpose domain to the biomedical domain can hurt the model's performance. We present an adversarial learning-based domain adaptation framework for the biomedical machine reading comprehension task (BioADAPT-MRC), a neural network-based method to address the discrepancies in the marginal distributions between the general and biomedical domain datasets. BioADAPT-MRC relaxes the need for generating pseudo labels for training a well-performing biomedical-MRC model. We extensively evaluate the performance of BioADAPT-MRC by comparing it with the best existing methods on three widely used benchmark biomedical-MRC datasets -- BioASQ-7b, BioASQ-8b, and BioASQ-9b. Our results suggest that without using any synthetic or human-annotated data from the biomedical domain, BioADAPT-MRC can achieve state-of-the-art performance on these datasets. Availability: BioADAPT-MRC is freely available as an open-source project at \url{https://github.com/mmahbub/BioADAPT-MRC}.
    Improving Fairness in Large-Scale Object Recognition by CrowdSourced Demographic Information. (arXiv:2206.01326v1 [cs.CV])
    There has been increasing awareness of ethical issues in machine learning, and fairness has become an important research topic. Most fairness efforts in computer vision have been focused on human sensing applications and preventing discrimination by people's physical attributes such as race, skin color or age by increasing visual representation for particular demographic groups. We argue that ML fairness efforts should extend to object recognition as well. Buildings, artwork, food and clothing are examples of the objects that define human culture. Representing these objects fairly in machine learning datasets will lead to models that are less biased towards a particular culture and more inclusive of different traditions and values. There exist many research datasets for object recognition, but they have not carefully considered which classes should be included, or how much training data should be collected per class. To address this, we propose a simple and general approach, based on crowdsourcing the demographic composition of the contributors: we define fair relevance scores, estimate them, and assign them to each class. We showcase its application to the landmark recognition domain, presenting a detailed analysis and the final fairer landmark rankings. We present analysis which leads to a much fairer coverage of the world compared to existing datasets. The evaluation dataset was used for the 2021 Google Landmark Challenges, which was the first of a kind with an emphasis on fairness in generic object recognition.
    Instant Graph Neural Networks for Dynamic Graphs. (arXiv:2206.01379v1 [cs.LG])
    Graph Neural Networks (GNNs) have been widely used for modeling graph-structured data. With the development of numerous GNN variants, recent years have witnessed groundbreaking results in improving the scalability of GNNs to work on static graphs with millions of nodes. However, how to instantly represent continuous changes of large-scale dynamic graphs with GNNs is still an open problem. Existing dynamic GNNs focus on modeling the periodic evolution of graphs, often on a snapshot basis. Such methods suffer from two drawbacks: first, there is a substantial delay for the changes in the graph to be reflected in the graph representations, resulting in losses on the model's accuracy; second, repeatedly calculating the representation matrix on the entire graph in each snapshot is predominantly time-consuming and severely limits the scalability. In this paper, we propose Instant Graph Neural Network (InstantGNN), an incremental computation approach for the graph representation matrix of dynamic graphs. Set to work with dynamic graphs with the edge-arrival model, our method avoids time-consuming, repetitive computations and allows instant updates on the representation and instant predictions. Graphs with dynamic structures and dynamic attributes are both supported. The upper bounds of time complexity of those updates are also provided. Furthermore, our method provides an adaptive training strategy, which guides the model to retrain at moments when it can make the greatest performance gains. We conduct extensive experiments on several real-world and synthetic datasets. Empirical results demonstrate that our model achieves state-of-the-art accuracy while having orders-of-magnitude higher efficiency than existing methods.
    Scalable Multirobot Planning for Informed Spatial Sampling. (arXiv:2105.10018v3 [cs.RO] UPDATED)
    This paper presents a distributed scalable multi-robot planning algorithm for informed sampling of quasistatic spatial fields. We address the problem of efficient data collection using multiple autonomous vehicles and consider the effects of communication between multiple robots, acting independently, on the overall sampling performance of the team. We focus on the distributed sampling problem where the robots operate independent of their teammates, but have the ability to communicate their current state to other neighbors within a fixed communication range. Our proposed approach is scalable and adaptive to various environmental scenarios, changing robot team configurations, and runs in real-time, which are important features for many real-world applications. We compare the performance of our proposed algorithm to baseline strategies through simulated experiments that utilize models derived from both synthetic and field deployment data. The results show that our sampling algorithm is efficient even when robots in the team are operating with a limited communication range, thus demonstrating the scalability of our method in sampling large-scale environments.
    Robust Multi-Objective Bayesian Optimization Under Input Noise. (arXiv:2202.07549v4 [cs.LG] UPDATED)
    Bayesian optimization (BO) is a sample-efficient approach for tuning design parameters to optimize expensive-to-evaluate, black-box performance metrics. In many manufacturing processes, the design parameters are subject to random input noise, resulting in a product that is often less performant than expected. Although BO methods have been proposed for optimizing a single objective under input noise, no existing method addresses the practical scenario where there are multiple objectives that are sensitive to input perturbations. In this work, we propose the first multi-objective BO method that is robust to input noise. We formalize our goal as optimizing the multivariate value-at-risk (MVaR), a risk measure of the uncertain objectives. Since directly optimizing MVaR is computationally infeasible in many settings, we propose a scalable, theoretically-grounded approach for optimizing MVaR using random scalarizations. Empirically, we find that our approach significantly outperforms alternative methods and efficiently identifies optimal robust designs that will satisfy specifications across multiple metrics with high probability.
    Adversarial Unlearning: Reducing Confidence Along Adversarial Directions. (arXiv:2206.01367v1 [cs.LG])
    Supervised learning methods trained with maximum likelihood objectives often overfit on training data. Most regularizers that prevent overfitting look to increase confidence on additional examples (e.g., data augmentation, adversarial training), or reduce it on training data (e.g., label smoothing). In this work we propose a complementary regularization strategy that reduces confidence on self-generated examples. The method, which we call RCAD (Reducing Confidence along Adversarial Directions), aims to reduce confidence on out-of-distribution examples lying along directions adversarially chosen to increase training loss. In contrast to adversarial training, RCAD does not try to robustify the model to output the original label, but rather regularizes it to have reduced confidence on points generated using much larger perturbations than in conventional adversarial training. RCAD can be easily integrated into training pipelines with a few lines of code. Despite its simplicity, we find on many classification benchmarks that RCAD can be added to existing techniques (e.g., label smoothing, MixUp training) to increase test accuracy by 1-3% in absolute value, with more significant gains in the low data regime. We also provide a theoretical analysis that helps to explain these benefits in simplified settings, showing that RCAD can provably help the model unlearn spurious features in the training data.
    Can Requirements Engineering Support Explainable Artificial Intelligence? Towards a User-Centric Approach for Explainability Requirements. (arXiv:2206.01507v1 [cs.SE])
    With the recent proliferation of artificial intelligence systems, there has been a surge in the demand for explainability of these systems. Explanations help to reduce system opacity, support transparency, and increase stakeholder trust. In this position paper, we discuss synergies between requirements engineering (RE) and Explainable AI (XAI). We highlight challenges in the field of XAI, and propose a framework and research directions on how RE practices can help to mitigate these challenges.
    Generalization for multiclass classification with overparameterized linear models. (arXiv:2206.01399v1 [cs.LG])
    Via an overparameterized linear model with Gaussian features, we provide conditions for good generalization for multiclass classification of minimum-norm interpolating solutions in an asymptotic setting where both the number of underlying features and the number of classes scale with the number of training points. The survival/contamination analysis framework for understanding the behavior of overparameterized learning problems is adapted to this setting, revealing that multiclass classification qualitatively behaves like binary classification in that, as long as there are not too many classes (made precise in the paper), it is possible to generalize well even in some settings where the corresponding regression tasks would not generalize. Besides various technical challenges, it turns out that the key difference from the binary classification setting is that there are relatively fewer positive training examples of each class in the multiclass setting as the number of classes increases, making the multiclass problem "harder" than the binary one.
    Evaluating Transfer-based Targeted Adversarial Perturbations against Real-World Computer Vision Systems based on Human Judgments. (arXiv:2206.01467v1 [cs.CV])
    Computer vision systems are remarkably vulnerable to adversarial perturbations. Transfer-based adversarial images are generated on one (source) system and used to attack another (target) system. In this paper, we take the first step to investigate transfer-based targeted adversarial images in a realistic scenario where the target system is trained on some private data with its inventory of semantic labels not publicly available. Our main contributions include an extensive human-judgment-based evaluation of attack success on the Google Cloud Vision API and additional analysis of the different behaviors of Google Cloud Vision in face of original images vs. adversarial images. Resources are publicly available at \url{https://github.com/ZhengyuZhao/Targeted-Tansfer/blob/main/google_results.zip}.
    Game of Privacy: Towards Better Federated Platform Collaboration under Privacy Restriction. (arXiv:2202.05139v3 [cs.LG] UPDATED)
    Vertical federated learning (VFL) aims to train models from cross-silo data with different feature spaces stored on different platforms. Existing VFL methods usually assume all data on each platform can be used for model training. However, due to the intrinsic privacy risks of federated learning, the total amount of involved data may be constrained. In addition, existing VFL studies usually assume only one platform has task labels and can benefit from the collaboration, making it difficult to attract other platforms to join in the collaborative learning. In this paper, we study the platform collaboration problem in VFL under privacy constraint. We propose to incent different platforms through a reciprocal collaboration, where all platforms can exploit multi-platform information in the VFL framework to benefit their own tasks. With limited privacy budgets, each platform needs to wisely allocate its data quotas for collaboration with other platforms. Thereby, they naturally form a multi-party game. There are two core problems in this game, i.e., how to appraise other platforms' data value to compute game rewards and how to optimize policies to solve the game. To evaluate the contributions of other platforms' data, each platform offers a small amount of "deposit" data to participate in the VFL. We propose a performance estimation method to predict the expected model performance when involving different amount combinations of inter-platform data. To solve the game, we propose a platform negotiation method that simulates the bargaining among platforms and locally optimizes their policies via gradient descent. Extensive experiments on two real-world datasets show that our approach can effectively facilitate the collaborative exploitation of multi-platform data in VFL under privacy restrictions.
    Optimal Weak to Strong Learning. (arXiv:2206.01563v1 [cs.LG])
    The classic algorithm AdaBoost allows to convert a weak learner, that is an algorithm that produces a hypothesis which is slightly better than chance, into a strong learner, achieving arbitrarily high accuracy when given enough training data. We present a new algorithm that constructs a strong learner from a weak learner but uses less training data than AdaBoost and all other weak to strong learners to achieve the same generalization bounds. A sample complexity lower bound shows that our new algorithm uses the minimum possible amount of training data and is thus optimal. Hence, this work settles the sample complexity of the classic problem of constructing a strong learner from a weak learner.
    Adaptive Learning for Discovery. (arXiv:2205.14829v2 [stat.ML] UPDATED)
    In this paper, we study a sequential decision-making problem, called Adaptive Sampling for Discovery (ASD). Starting with a large unlabeled dataset, algorithms for ASD adaptively label the points with the goal to maximize the sum of responses. This problem has wide applications to real-world discovery problems, for example drug discovery with the help of machine learning models. ASD algorithms face the well-known exploration-exploitation dilemma. The algorithm needs to choose points that yield information to improve model estimates but it also needs to exploit the model. We rigorously formulate the problem and propose a general information-directed sampling (IDS) algorithm. We provide theoretical guarantees for the performance of IDS in linear, graph and low-rank models. The benefits of IDS are shown in both simulation experiments and real-data experiments for discovering chemical reaction conditions.
    Constraining Gaussian processes for physics-informed acoustic emission mapping. (arXiv:2206.01495v1 [cs.LG])
    The automated localisation of damage in structures is a challenging but critical ingredient in the path towards predictive or condition-based maintenance of high value structures. The use of acoustic emission time of arrival mapping is a promising approach to this challenge, but is severely hindered by the need to collect a dense set of artificial acoustic emission measurements across the structure, resulting in a lengthy and often impractical data acquisition process. In this paper, we consider the use of physics-informed Gaussian processes for learning these maps to alleviate this problem. In the approach, the Gaussian process is constrained to the physical domain such that information relating to the geometry and boundary conditions of the structure are embedded directly into the learning process, returning a model that guarantees that any predictions made satisfy physically-consistent behaviour at the boundary. A number of scenarios that arise when training measurement acquisition is limited, including where training data are sparse, and also of limited coverage over the structure of interest. Using a complex plate-like structure as an experimental case study, we show that our approach significantly reduces the burden of data collection, where it is seen that incorporation of boundary condition knowledge significantly improves predictive accuracy as training observations are reduced, particularly when training measurements are not available across all parts of the structure.
    Accelerating hydrodynamic simulations of urban drainage systems with physics-guided machine learning. (arXiv:2206.01538v1 [cs.LG])
    We propose and demonstrate a new approach for fast and accurate surrogate modelling of urban drainage system hydraulics based on physics-guided machine learning. The surrogates are trained against a limited set of simulation results from a hydrodynamic (HiFi) model. Our approach reduces simulation times by one to two orders of magnitude compared to a HiFi model. It is thus slower than e.g. conceptual hydrological models, but it enables simulations of water levels, flows and surcharges in all nodes and links of a drainage network and thus largely preserves the level of detail provided by HiFi models. Comparing time series simulated by the surrogate and the HiFi model, R2 values in the order of 0.9 are achieved. Surrogate training times are currently in the order of one hour. However, they can likely be reduced through the application of transfer learning and graph neural networks. Our surrogate approach will be useful for interactive workshops in initial design phases of urban drainage systems, as well as for real time applications. In addition, our model formulation is generic and future research should investigate its application for simulating other water systems.
    Reinforcement Learning with Neural Radiance Fields. (arXiv:2206.01634v1 [cs.LG])
    It is a long-standing problem to find effective representations for training reinforcement learning (RL) agents. This paper demonstrates that learning state representations with supervision from Neural Radiance Fields (NeRFs) can improve the performance of RL compared to other learned representations or even low-dimensional, hand-engineered state information. Specifically, we propose to train an encoder that maps multiple image observations to a latent space describing the objects in the scene. The decoder built from a latent-conditioned NeRF serves as the supervision signal to learn the latent space. An RL algorithm then operates on the learned latent space as its state representation. We call this NeRF-RL. Our experiments indicate that NeRF as supervision leads to a latent space better suited for the downstream RL tasks involving robotic object manipulations like hanging mugs on hooks, pushing objects, or opening doors. Video: https://dannydriess.github.io/nerf-rl
    UncertaINR: Uncertainty Quantification of End-to-End Implicit Neural Representations for Computed Tomography. (arXiv:2202.10847v2 [eess.IV] UPDATED)
    Implicit neural representations (INRs) have achieved impressive results for scene reconstruction and computer graphics, where their performance has primarily been assessed on reconstruction accuracy. As INRs make their way into other domains, where model predictions inform high-stakes decision-making, uncertainty quantification of INR inference is becoming critical. To that end, we study a Bayesian reformulation of INRs, UncertaINR, in the context of computed tomography, and evaluate several Bayesian deep learning implementations in terms of accuracy and calibration. We find that they achieve well-calibrated uncertainty, while retaining accuracy competitive with other classical, INR-based, and CNN-based reconstruction techniques. In contrast to the best-performing prior approaches, UncertaINR does not require a large training dataset, but only a handful of validation images.
    Central-Smoothing Hypergraph Neural Networks for Predicting Drug-Drug Interactions. (arXiv:2112.07837v3 [cs.LG] UPDATED)
    Predicting drug-drug interactions (DDI) is the problem of predicting side effects (unwanted outcomes) of a pair of drugs using drug information and known side effects of many pairs. This problem can be formulated as predicting labels (i.e. side effects) for each pair of nodes in a DDI graph, of which nodes are drugs and edges are interacting drugs with known labels. State-of-the-art methods for this problem are graph neural networks (GNNs), which leverage neighborhood information in the graph to learn node representations. For DDI, however, there are many labels with complicated relationships due to the nature of side effects. Usual GNNs often fix labels as one-hot vectors that do not reflect label relationships and potentially do not obtain the highest performance in the difficult cases of infrequent labels. In this paper, we formulate DDI as a hypergraph where each hyperedge is a triple: two nodes for drugs and one node for a label. We then present CentSmoothie, a hypergraph neural network that learns representations of nodes and labels altogether with a novel central-smoothing formulation. We empirically demonstrate the performance advantages of CentSmoothie in simulations as well as real datasets.
    Transformer-Based Self-Supervised Learning for Emotion Recognition. (arXiv:2204.05103v2 [q-bio.NC] CROSS LISTED)
    In order to exploit representations of time-series signals, such as physiological signals, it is essential that these representations capture relevant information from the whole signal. In this work, we propose to use a Transformer-based model to process electrocardiograms (ECG) for emotion recognition. Attention mechanisms of the Transformer can be used to build contextualized representations for a signal, giving more importance to relevant parts. These representations may then be processed with a fully-connected network to predict emotions. To overcome the relatively small size of datasets with emotional labels, we employ self-supervised learning. We gathered several ECG datasets with no labels of emotion to pre-train our model, which we then fine-tuned for emotion recognition on the AMIGOS dataset. We show that our approach reaches state-of-the-art performances for emotion recognition using ECG signals on AMIGOS. More generally, our experiments show that transformers and pre-training are promising strategies for emotion recognition with physiological signals.
    Incrementality Bidding via Reinforcement Learning under Mixed and Delayed Rewards. (arXiv:2206.01293v1 [cs.LG])
    Incrementality, which is used to measure the causal effect of showing an ad to a potential customer (e.g. a user in an internet platform) versus not, is a central object for advertisers in online advertising platforms. This paper investigates the problem of how an advertiser can learn to optimize the bidding sequence in an online manner \emph{without} knowing the incrementality parameters in advance. We formulate the offline version of this problem as a specially structured episodic Markov Decision Process (MDP) and then, for its online learning counterpart, propose a novel reinforcement learning (RL) algorithm with regret at most $\widetilde{O}(H^2\sqrt{T})$, which depends on the number of rounds $H$ and number of episodes $T$, but does not depend on the number of actions (i.e., possible bids). A fundamental difference between our learning problem from standard RL problems is that the realized reward feedback from conversion incrementality is \emph{mixed} and \emph{delayed}. To handle this difficulty we propose and analyze a novel pairwise moment-matching algorithm to learn the conversion incrementality, which we believe is of independent of interest.
    Completion Time Minimization of Fog-RAN-Assisted Federated Learning With Rate-Splitting Transmission. (arXiv:2206.01373v1 [eess.SP])
    This work studies federated learning (FL) over a fog radio access network, in which multiple internet-of-things (IoT) devices cooperatively learn a shared machine learning model by communicating with a cloud server (CS) through distributed access points (APs). Under the assumption that the fronthaul links connecting APs to CS have finite capacity, a rate-splitting transmission at IoT devices (IDs) is proposed which enables hybrid edge and cloud decoding of split uplink messages. The problem of completion time minimization for FL is tackled by optimizing the rate-splitting transmission and fronthaul quantization strategies along with training hyperparameters such as precision and iteration numbers. Numerical results show that the proposed rate-splitting transmission achieves notable gains over benchmark schemes which rely solely on edge or cloud decoding.
    A Survey on Surrogate-assisted Efficient Neural Architecture Search. (arXiv:2206.01520v1 [cs.LG])
    Neural architecture search (NAS) has become increasingly popular in the deep learning community recently, mainly because it can provide an opportunity to allow interested users without rich expertise to benefit from the success of deep neural networks (DNNs). However, NAS is still laborious and time-consuming because a large number of performance estimations are required during the search process of NAS, and training DNNs is computationally intensive. To solve the major limitation of NAS, improving the efficiency of NAS is essential in the design of NAS. This paper begins with a brief introduction to the general framework of NAS. Then, the methods for evaluating network candidates under the proxy metrics are systematically discussed. This is followed by a description of surrogate-assisted NAS, which is divided into three different categories, namely Bayesian optimization for NAS, surrogate-assisted evolutionary algorithms for NAS, and MOP for NAS. Finally, remaining challenges and open research questions are discussed, and promising research topics are suggested in this emerging field.
    Decentralized Training of Foundation Models in Heterogeneous Environments. (arXiv:2206.01288v1 [cs.DC])
    Training foundation models, such as GPT-3 and PaLM, can be extremely expensive, often involving tens of thousands of GPUs running continuously for months. These models are typically trained in specialized clusters featuring fast, homogeneous interconnects and using carefully designed software systems that support both data parallelism and model/pipeline parallelism. Such dedicated clusters can be costly and difficult to obtain. Can we instead leverage the much greater amount of decentralized, heterogeneous, and lower-bandwidth interconnected compute? Previous works examining the heterogeneous, decentralized setting focus on relatively small models that can be trained in a purely data parallel manner. State-of-the-art schemes for model parallel foundation model training, such as Megatron, only consider the homogeneous data center setting. In this paper, we present the first study of training large foundation models with model parallelism in a decentralized regime over a heterogeneous network. Our key technical contribution is a scheduling algorithm that allocates different computational "tasklets" in the training of foundation models to a group of decentralized GPU devices connected by a slow heterogeneous network. We provide a formal cost model and further propose an efficient evolutionary algorithm to find the optimal allocation strategy. We conduct extensive experiments that represent different scenarios for learning over geo-distributed devices simulated using real-world network measurements. In the most extreme case, across 8 different cities spanning 3 continents, our approach is 4.8X faster than prior state-of-the-art training systems (Megatron).
    Fine-tuning Language Models over Slow Networks using Activation Compression with Guarantees. (arXiv:2206.01299v1 [cs.LG])
    Communication compression is a crucial technique for modern distributed learning systems to alleviate their communication bottlenecks over slower networks. Despite recent intensive studies of gradient compression for data parallel-style training, compressing the activations for models trained with pipeline parallelism is still an open problem. In this paper, we propose AC-SGD, a novel activation compression algorithm for communication-efficient pipeline parallelism training over slow networks. Different from previous efforts in activation compression, instead of compressing activation values directly, AC-SGD compresses the changes of the activations. This allows us to show, to the best of our knowledge for the first time, that one can still achieve $O(1/\sqrt{T})$ convergence rate for non-convex objectives under activation compression, without making assumptions on gradient unbiasedness that do not hold for deep learning models with non-linear activation functions.We then show that AC-SGD can be optimized and implemented efficiently, without additional end-to-end runtime overhead.We evaluated AC-SGD to fine-tune language models with up to 1.5 billion parameters, compressing activations to 2-4 bits.AC-SGD provides up to 4.3X end-to-end speed-up in slower networks, without sacrificing model quality. Moreover, we also show that AC-SGD can be combined with state-of-the-art gradient compression algorithms to enable "end-to-end communication compression: All communications between machines, including model gradients, forward activations, and backward gradients are compressed into lower precision.This provides up to 4.9X end-to-end speed-up, without sacrificing model quality.
    Rethinking Class-Prior Estimation for Positive-Unlabeled Learning. (arXiv:2002.03673v2 [cs.LG] UPDATED)
    Given only positive (P) and unlabeled (U) data, PU learning can train a binary classifier without any negative data. It has two building blocks: PU class-prior estimation (CPE) and PU classification; the latter has been well studied while the former has received less attention. Hitherto, the distributional-assumption-free CPE methods rely on a critical assumption that the support of the positive data distribution cannot be contained in the support of the negative data distribution. If this is violated, those CPE methods will systematically overestimate the class prior; it is even worse that we cannot verify the assumption based on the data. In this paper, we rethink CPE for PU learning-can we remove the assumption to make CPE always valid? We show an affirmative answer by proposing Regrouping CPE (ReCPE) that builds an auxiliary probability distribution such that the support of the positive data distribution is never contained in the support of the negative data distribution. ReCPE can work with any CPE method by treating it as the base method. Theoretically, ReCPE does not affect its base if the assumption already holds for the original probability distribution; otherwise, it reduces the positive bias of its base. Empirically, ReCPE improves all state-of-the-art CPE methods on various datasets, implying that the assumption has indeed been violated here.
    Approximation of Images via Generalized Higher Order Singular Value Decomposition over Finite-dimensional Commutative Semisimple Algebra. (arXiv:2202.00450v7 [cs.LG] UPDATED)
    Low-rank approximation of images via singular value decomposition is well-received in the era of big data. However, singular value decomposition (SVD) is only for order-two data, i.e., matrices. It is necessary to flatten a higher order input into a matrix or break it into a series of order-two slices to tackle higher order data such as multispectral images and videos with the SVD. Higher order singular value decomposition (HOSVD) extends the SVD and can approximate higher order data using sums of a few rank-one components. We consider the problem of generalizing HOSVD over a finite dimensional commutative algebra. This algebra, referred to as a t-algebra, generalizes the field of complex numbers. The elements of the algebra, called t-scalars, are fix-sized arrays of complex numbers. One can generalize matrices and tensors over t-scalars and then extend many canonical matrix and tensor algorithms, including HOSVD, to obtain higher-performance versions. The generalization of HOSVD is called THOSVD. Its performance of approximating multi-way data can be further improved by an alternating algorithm. THOSVD also unifies a wide range of principal component analysis algorithms. To exploit the potential of generalized algorithms using t-scalars for approximating images, we use a pixel neighborhood strategy to convert each pixel to "deeper-order" t-scalar. Experiments on publicly available images show that the generalized algorithm over t-scalars, namely THOSVD, compares favorably with its canonical counterparts.
    Disentangling Epistemic and Aleatoric Uncertainty in Reinforcement Learning. (arXiv:2206.01558v1 [cs.LG])
    Characterizing aleatoric and epistemic uncertainty on the predicted rewards can help in building reliable reinforcement learning (RL) systems. Aleatoric uncertainty results from the irreducible environment stochasticity leading to inherently risky states and actions. Epistemic uncertainty results from the limited information accumulated during learning to make informed decisions. Characterizing aleatoric and epistemic uncertainty can be used to speed up learning in a training environment, improve generalization to similar testing environments, and flag unfamiliar behavior in anomalous testing environments. In this work, we introduce a framework for disentangling aleatoric and epistemic uncertainty in RL. (1) We first define four desiderata that capture the desired behavior for aleatoric and epistemic uncertainty estimation in RL at both training and testing time. (2) We then present four RL models inspired by supervised learning (i.e. Monte Carlo dropout, ensemble, deep kernel learning models, and evidential networks) to instantiate aleatoric and epistemic uncertainty. Finally, (3) we propose a practical evaluation method to evaluate uncertainty estimation in model-free RL based on detection of out-of-distribution environments and generalization to perturbed environments. We present theoretical and experimental evidence to validate that carefully equipping model-free RL agents with supervised learning uncertainty methods can fulfill our desiderata.
    Compositional Scene Representation Learning via Reconstruction: A Survey. (arXiv:2202.07135v2 [cs.LG] UPDATED)
    Visual scene representation learning is an important research problem in the field of computer vision. The performance of artificial intelligence systems on vision tasks could be improved if more suitable representations are learned for visual scenes. Complex visual scenes are composed of relatively simple visual concepts, and have the property of combinatorial explosion. Compared with directly representing the entire visual scene, extracting compositional scene representations can better cope with the diverse combinations of background and objects. Because compositional scene representations abstract the concept of objects, performing visual scene analysis and understanding based on these representations could be easier and more interpretable. Moreover, learning via reconstruction can greatly reduce the need for training data annotations. Therefore, reconstruction-based compositional scene representation learning has important research significance. In this survey, we first outline the current progress on this research topic, including development history and categorizations of existing methods from the perspectives of modeling of visual scenes and inference of scene representations; then provide benchmarks, including an open source toolbox to reproduce the benchmark experiments, of representative methods that consider the most extensively studied problem setting and form the foundation for other methods; and finally discuss the future directions of this research topic.
    Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning. (arXiv:2201.12023v2 [cs.LG] UPDATED)
    Alpa automates model-parallel training of large deep learning (DL) models by generating execution plans that unify data, operator, and pipeline parallelism. Existing model-parallel training systems either require users to manually create a parallelization plan or automatically generate one from a limited space of model parallelism configurations. They do not suffice to scale out complex DL models on distributed compute devices. Alpa distributes the training of large DL models by viewing parallelisms as two hierarchical levels: inter-operator and intra-operator parallelisms. Based on it, Alpa constructs a new hierarchical space for massive model-parallel execution plans. Alpa designs a number of compilation passes to automatically derive efficient parallel execution plans at each parallelism level. Alpa implements an efficient runtime to orchestrate the two-level parallel execution on distributed compute devices. Our evaluation shows Alpa generates parallelization plans that match or outperform hand-tuned model-parallel training systems even on models they are designed for. Unlike specialized systems, Alpa also generalizes to models with heterogeneous architectures and models without manually-designed plans. Alpa's source code is publicly available at https://github.com/alpa-projects/alpa
    Understanding the Role of Nonlinearity in Training Dynamics of Contrastive Learning. (arXiv:2206.01342v1 [cs.LG])
    While the empirical success of self-supervised learning (SSL) heavily relies on the usage of deep nonlinear models, many theoretical works proposed to understand SSL still focus on linear ones. In this paper, we study the role of nonlinearity in the training dynamics of contrastive learning (CL) on one and two-layer nonlinear networks with homogeneous activation $h(x) = h'(x)x$. We theoretically demonstrate that (1) the presence of nonlinearity leads to many local optima even in 1-layer setting, each corresponding to certain patterns from the data distribution, while with linear activation, only one major pattern can be learned; and (2) nonlinearity leads to specialized weights into diverse patterns, a behavior that linear activation is proven not capable of. These findings suggest that models with lots of parameters can be regarded as a \emph{brute-force} way to find these local optima induced by nonlinearity, a possible underlying reason why empirical observations such as the lottery ticket hypothesis hold. In addition, for 2-layer setting, we also discover \emph{global modulation}: those local patterns discriminative from the perspective of global-level patterns are prioritized to learn, further characterizing the learning process. Simulation verifies our theoretical findings.
    OntoProtein: Protein Pretraining With Gene Ontology Embedding. (arXiv:2201.11147v6 [q-bio.BM] UPDATED)
    Self-supervised protein language models have proved their effectiveness in learning the proteins representations. With the increasing computational power, current protein language models pre-trained with millions of diverse sequences can advance the parameter scale from million-level to billion-level and achieve remarkable improvement. However, those prevailing approaches rarely consider incorporating knowledge graphs (KGs), which can provide rich structured knowledge facts for better protein representations. We argue that informative biology knowledge in KGs can enhance protein representation with external knowledge. In this work, we propose OntoProtein, the first general framework that makes use of structure in GO (Gene Ontology) into protein pre-training models. We construct a novel large-scale knowledge graph that consists of GO and its related proteins, and gene annotation texts or protein sequences describe all nodes in the graph. We propose novel contrastive learning with knowledge-aware negative sampling to jointly optimize the knowledge graph and protein embedding during pre-training. Experimental results show that OntoProtein can surpass state-of-the-art methods with pre-trained protein language models in TAPE benchmark and yield better performance compared with baselines in protein-protein interaction and protein function prediction. Code and datasets are available in https://github.com/zjunlp/OntoProtein.
    XPASC: Measuring Generalization in Weak Supervision. (arXiv:2206.01444v1 [cs.LG])
    Weak supervision is leveraged in a wide range of domains and tasks due to its ability to create massive amounts of labeled data, requiring only little manual effort. Standard approaches use labeling functions to specify signals that are relevant for the labeling. It has been conjectured that weakly supervised models over-rely on those signals and as a result suffer from overfitting. To verify this assumption, we introduce a novel method, XPASC (eXPlainability-Association SCore), for measuring the generalization of a model trained with a weakly supervised dataset. Considering the occurrences of features, classes and labeling functions in a dataset, XPASC takes into account the relevance of each feature for the predictions of the model as well as the associations of the feature with the class and the labeling function, respectively. The association in XPASC can be measured in two variants: XPASC-CHI SQAURE measures associations relative to their statistical significance, while XPASC-PPMI measures association strength more generally. We use XPASC to analyze KnowMAN, an adversarial architecture intended to control the degree of generalization from the labeling functions and thus to mitigate the problem of overfitting. On one hand, we show that KnowMAN is able to control the degree of generalization through a hyperparameter. On the other hand, results and qualitative analysis show that generalization and performance do not relate one-to-one, and that the highest degree of generalization does not necessarily imply the best performance. Therefore methods that allow for controlling the amount of generalization can achieve the right degree of benign overfitting. Our contributions in this study are i) the XPASC score to measure generalization in weakly-supervised models, ii) evaluation of XPASC across datasets and models and iii) the release of the XPASC implementation.
    Equipping Black-Box Policies with Model-Based Advice for Stable Nonlinear Control. (arXiv:2206.01341v1 [cs.LG])
    Machine-learned black-box policies are ubiquitous for nonlinear control problems. Meanwhile, crude model information is often available for these problems from, e.g., linear approximations of nonlinear dynamics. We study the problem of equipping a black-box control policy with model-based advice for nonlinear control on a single trajectory. We first show a general negative result that a naive convex combination of a black-box policy and a linear model-based policy can lead to instability, even if the two policies are both stabilizing. We then propose an adaptive $\lambda$-confident policy, with a coefficient $\lambda$ indicating the confidence in a black-box policy, and prove its stability. With bounded nonlinearity, in addition, we show that the adaptive $\lambda$-confident policy achieves a bounded competitive ratio when a black-box policy is near-optimal. Finally, we propose an online learning approach to implement the adaptive $\lambda$-confident policy and verify its efficacy in case studies about the CartPole problem and a real-world electric vehicle (EV) charging problem with data bias due to COVID-19.
    Modeling electronic health record data using a knowledge-graph-embedded topic model. (arXiv:2206.01436v1 [cs.LG])
    The rapid growth of electronic health record (EHR) datasets opens up promising opportunities to understand human diseases in a systematic way. However, effective extraction of clinical knowledge from the EHR data has been hindered by its sparsity and noisy information. We present KG-ETM, an end-to-end knowledge graph-based multimodal embedded topic model. KG-ETM distills latent disease topics from EHR data by learning the embedding from the medical knowledge graphs. We applied KG-ETM to a large-scale EHR dataset consisting of over 1 million patients. We evaluated its performance based on EHR reconstruction and drug imputation. KG-ETM demonstrated superior performance over the alternative methods on both tasks. Moreover, our model learned clinically meaningful graph-informed embedding of the EHR codes. In additional, our model is also able to discover interpretable and accurate patient representations for patient stratification and drug recommendations.
    Regularization-wise double descent: Why it occurs and how to eliminate it. (arXiv:2206.01378v1 [cs.LG])
    The risk of overparameterized models, in particular deep neural networks, is often double-descent shaped as a function of the model size. Recently, it was shown that the risk as a function of the early-stopping time can also be double-descent shaped, and this behavior can be explained as a super-position of bias-variance tradeoffs. In this paper, we show that the risk of explicit L2-regularized models can exhibit double descent behavior as a function of the regularization strength, both in theory and practice. We find that for linear regression, a double descent shaped risk is caused by a superposition of bias-variance tradeoffs corresponding to different parts of the model and can be mitigated by scaling the regularization strength of each part appropriately. Motivated by this result, we study a two-layer neural network and show that double descent can be eliminated by adjusting the regularization strengths for the first and second layer. Lastly, we study a 5-layer CNN and ResNet-18 trained on CIFAR-10 with label noise, and CIFAR-100 without label noise, and demonstrate that all exhibit double descent behavior as a function of the regularization strength.
    A Theoretical Analysis on Feature Learning in Neural Networks: Emergence from Inputs and Advantage over Fixed Features. (arXiv:2206.01717v1 [cs.LG])
    An important characteristic of neural networks is their ability to learn representations of the input data with effective features for prediction, which is believed to be a key factor to their superior empirical performance. To better understand the source and benefit of feature learning in neural networks, we consider learning problems motivated by practical data, where the labels are determined by a set of class relevant patterns and the inputs are generated from these along with some background patterns. We prove that neural networks trained by gradient descent can succeed on these problems. The success relies on the emergence and improvement of effective features, which are learned among exponentially many candidates efficiently by exploiting the data (in particular, the structure of the input distribution). In contrast, no linear models on data-independent features of polynomial sizes can learn to as good errors. Furthermore, if the specific input structure is removed, then no polynomial algorithm in the Statistical Query model can learn even weakly. These results provide theoretical evidence showing that feature learning in neural networks depends strongly on the input structure and leads to the superior performance. Our preliminary experimental results on synthetic and real data also provide positive support.
    MetaLR: Layer-wise Learning Rate based on Meta-Learning for Adaptively Fine-tuning Medical Pre-trained Models. (arXiv:2206.01408v1 [cs.CV])
    When applying transfer learning for medical image analysis, downstream tasks often have significant gaps with the pre-training tasks. Previous methods mainly focus on improving the transferabilities of the pre-trained models to bridge the gaps. In fact, model fine-tuning can also play a very important role in tackling this problem. A conventional fine-tuning method is updating all deep neural networks (DNNs) layers by a single learning rate (LR), which ignores the unique transferabilities of different layers. In this work, we explore the behaviors of different layers in the fine-tuning stage. More precisely, we first hypothesize that lower-level layers are more domain-specific while higher-level layers are more task-specific, which is verified by a simple bi-directional fine-tuning scheme. It is harder for the pre-trained specific layers to transfer to new tasks than general layers. On this basis, to make different layers better co-adapt to the downstream tasks according to their transferabilities, a meta-learning-based LR learner, namely MetaLR, is proposed to assign LRs for each layer automatically. Extensive experiments on various medical applications (i.e., POCUS, BUSI, Chest X-ray, and LiTS) well confirm our hypothesis and show the superior performance of the proposed methods to previous state-of-the-art fine-tuning methods.
    Detecting Pulmonary Embolism from Computed Tomography Using Convolutional Neural Network. (arXiv:2206.01344v1 [eess.IV])
    The clinical symptoms of pulmonary embolism (PE) are very diverse and non-specific, which makes it difficult to diagnose. In addition, pulmonary embolism has multiple triggers and is one of the major causes of vascular death. Therefore, if it can be detected and treated quickly, it can significantly reduce the risk of death in hospitalized patients. In the detection process, the cost of computed tomography pulmonary angiography (CTPA) is high, and angiography requires the injection of contrast agents, which increase the risk of damage to the patient. Therefore, this study will use a deep learning approach to detect pulmonary embolism in all patients who take a CT image of the chest using a convolutional neural network. With the proposed pulmonary embolism detection system, we can detect the possibility of pulmonary embolism at the same time as the patient's first CT image, and schedule the CTPA test immediately, saving more than a week of CT image screening time and providing timely diagnosis and treatment to the patient.
    CodedPaddedFL and CodedSecAgg: Straggler Mitigation and Secure Aggregation in Federated Learning. (arXiv:2112.08909v2 [cs.LG] UPDATED)
    We present two novel federated learning (FL) schemes that mitigate the effect of straggling devices by introducing redundancy on the devices' data across the network. Compared to other schemes in the literature, which deal with stragglers or device dropouts by ignoring their contribution, the proposed schemes do not suffer from the client drift problem. The first scheme, CodedPaddedFL, mitigates the effect of stragglers while retaining the privacy level of conventional FL. It combines one-time padding for user data privacy with gradient codes to yield straggler resiliency. The second scheme, CodedSecAgg, provides straggler resiliency and robustness against model inversion attacks and is based on Shamir's secret sharing. We apply CodedPaddedFL and CodedSecAgg to a classification problem. For a scenario with 120 devices, CodedPaddedFL achieves a speed-up factor of 18 for an accuracy of 95% on the MNIST dataset compared to conventional FL. Furthermore, it yields similar performance in terms of latency compared to a recently proposed scheme by Prakash et al. without the shortcoming of additional leakage of private data. CodedSecAgg outperforms the state-of-the-art secure aggregation scheme LightSecAgg by a speed-up factor of 6.6-18.7 for the MNIST dataset for an accuracy of 95%.
    A New Security Boundary of Component Differentially Challenged XOR PUFs Against Machine Learning Modeling Attacks. (arXiv:2206.01314v1 [cs.CR])
    Physical Unclonable Functions (PUFs) are promising security primitives for resource-constrained network nodes. The XOR Arbiter PUF (XOR PUF or XPUF) is an intensively studied PUF invented to improve the security of the Arbiter PUF, probably the most lightweight delay-based PUF. Recently, highly powerful machine learning attack methods were discovered and were able to easily break large-sized XPUFs, which were highly secure against earlier machine learning attack methods. Component-differentially-challenged XPUFs (CDC-XPUFs) are XPUFs with different component PUFs receiving different challenges. Studies showed they were much more secure against machine learning attacks than the conventional XPUFs, whose component PUFs receive the same challenge. But these studies were all based on earlier machine learning attack methods, and hence it is not clear if CDC-XPUFs can remain secure under the recently discovered powerful attack methods. In this paper, the two current most powerful two machine learning methods for attacking XPUFs are adapted by fine-tuning the parameters of the two methods for CDC-XPUFs. Attack experiments using both simulated PUF data and silicon data generated from PUFs implemented on field-programmable gate array (FPGA) were carried out, and the experimental results showed that some previously secure CDC-XPUFs of certain circuit parameter values are no longer secure under the adapted new attack methods, while many more CDC-XPUFs of other circuit parameter values remain secure. Thus, our experimental attack study has re-defined the boundary between the secure region and the insecure region of the PUF circuit parameter space, providing PUF manufacturers and IoT security application developers with valuable information in choosing PUFs with secure parameter values.
    MultiHiertt: Numerical Reasoning over Multi Hierarchical Tabular and Textual Data. (arXiv:2206.01347v1 [cs.AI])
    Numerical reasoning over hybrid data containing both textual and tabular content (e.g., financial reports) has recently attracted much attention in the NLP community. However, existing question answering (QA) benchmarks over hybrid data only include a single flat table in each document and thus lack examples of multi-step numerical reasoning across multiple hierarchical tables. To facilitate data analytical progress, we construct a new large-scale benchmark, MultiHiertt, with QA pairs over Multi Hierarchical Tabular and Textual data. MultiHiertt is built from a wealth of financial reports and has the following unique characteristics: 1) each document contain multiple tables and longer unstructured texts; 2) most of tables contained are hierarchical; 3) the reasoning process required for each question is more complex and challenging than existing benchmarks; and 4) fine-grained annotations of reasoning processes and supporting facts are provided to reveal complex numerical reasoning. We further introduce a novel QA model termed MT2Net, which first applies facts retrieving to extract relevant supporting facts from both tables and text and then uses a reasoning module to perform symbolic reasoning over retrieved facts. We conduct comprehensive experiments on various baselines. The experimental results show that MultiHiertt presents a strong challenge for existing baselines whose results lag far behind the performance of human experts. The dataset and code are publicly available at https://github.com/psunlpgroup/MultiHiertt.
    Fair Classification via Transformer Neural Networks: Case Study of an Educational Domain. (arXiv:2206.01410v1 [cs.LG])
    Educational technologies nowadays increasingly use data and Machine Learning (ML) models. This gives the students, instructors, and administrators support and insights for the optimum policy. However, it is well acknowledged that ML models are subject to bias, which raises concern about the fairness, bias, and discrimination of using these automated ML algorithms in education and its unintended and unforeseen negative consequences. The contribution of bias during the decision-making comes from datasets used for training ML models and the model architecture. This paper presents a preliminary investigation of fairness constraint in transformer neural networks on Law School and Student-Mathematics datasets. The used transformer models transform these raw datasets into a richer representation space of natural language processing (NLP) while solving fairness classification. We have employed fairness metrics for evaluation and check the trade-off between fairness and accuracy. We have reported the various metrics of F1, SPD, EOD, and accuracy for different architectures from the transformer model class.
    Global Self-Attention as a Replacement for Graph Convolution. (arXiv:2108.03348v3 [cs.LG] UPDATED)
    We propose an extension to the transformer neural network architecture for general-purpose graph learning by adding a dedicated pathway for pairwise structural information, called edge channels. The resultant framework - which we call Edge-augmented Graph Transformer (EGT) - can directly accept, process and output structural information of arbitrary form, which is important for effective learning on graph-structured data. Our model exclusively uses global self-attention as an aggregation mechanism rather than static localized convolutional aggregation. This allows for unconstrained long-range dynamic interactions between nodes. Moreover, the edge channels allow the structural information to evolve from layer to layer, and prediction tasks on edges/links can be performed directly from the output embeddings of these channels. We verify the performance of EGT in a wide range of graph-learning experiments on benchmark datasets, in which it outperforms Convolutional/Message-Passing Graph Neural Networks. EGT sets a new state-of-the-art for the quantum-chemical regression task on the OGB-LSC PCQM4Mv2 dataset containing 3.8 million molecular graphs. Our findings indicate that global self-attention based aggregation can serve as a flexible, adaptive and effective replacement of graph convolution for general-purpose graph learning. Therefore, convolutional local neighborhood aggregation is not an essential inductive bias.
    Learning a Restricted Boltzmann Machine using biased Monte Carlo sampling. (arXiv:2206.01310v1 [cs.LG])
    Restricted Boltzmann Machines are simple and powerful generative models capable of encoding any complex dataset. Despite all their advantages, in practice, trainings are often unstable, and it is hard to assess their quality because dynamics are hampered by extremely slow time-dependencies. This situation becomes critical when dealing with low-dimensional clustered datasets, where the time needed to sample ergodically the trained models becomes computationally prohibitive. In this work, we show that this divergence of Monte Carlo mixing times is related to a phase coexistence phenomenon, similar to that encountered in Physics in the vicinity of a first order phase transition. We show that sampling the equilibrium distribution via Markov Chain Monte Carlo can be dramatically accelerated using biased sampling techniques, in particular, the Tethered Monte Carlo method (TMC). This sampling technique solves efficiently the problem of evaluating the quality of a given trained model and the generation of new samples in reasonable times. In addition, we show that this sampling technique can be exploited to improve the computation of the log-likelihood gradient during the training too, which produces dramatic improvements when training RBMs with artificial clustered datasets. When dealing with real low-dimensional datasets, this new training procedure fits RBM models with significantly faster relaxational dynamics than those obtained with standard PCD recipes. We also show that TMC sampling can be used to recover free-energy profile of the RBM, which turns out to be extremely useful to compute the probability distribution of a given model and to improve the generation of new decorrelated samples on slow PCD trained models.
    Sample-Efficient Reinforcement Learning of Partially Observable Markov Games. (arXiv:2206.01315v1 [cs.LG])
    This paper considers the challenging tasks of Multi-Agent Reinforcement Learning (MARL) under partial observability, where each agent only sees her own individual observations and actions that reveal incomplete information about the underlying state of system. This paper studies these tasks under the general model of multiplayer general-sum Partially Observable Markov Games (POMGs), which is significantly larger than the standard model of Imperfect Information Extensive-Form Games (IIEFGs). We identify a rich subclass of POMGs -- weakly revealing POMGs -- in which sample-efficient learning is tractable. In the self-play setting, we prove that a simple algorithm combining optimism and Maximum Likelihood Estimation (MLE) is sufficient to find approximate Nash equilibria, correlated equilibria, as well as coarse correlated equilibria of weakly revealing POMGs, in a polynomial number of samples when the number of agents is small. In the setting of playing against adversarial opponents, we show that a variant of our optimistic MLE algorithm is capable of achieving sublinear regret when being compared against the optimal maximin policies. To our best knowledge, this work provides the first line of sample-efficient results for learning POMGs.
    The geometry of integration in text classification RNNs. (arXiv:2010.15114v2 [cs.LG] UPDATED)
    Despite the widespread application of recurrent neural networks (RNNs) across a variety of tasks, a unified understanding of how RNNs solve these tasks remains elusive. In particular, it is unclear what dynamical patterns arise in trained RNNs, and how those patterns depend on the training dataset or task. This work addresses these questions in the context of a specific natural language processing task: text classification. Using tools from dynamical systems analysis, we study recurrent networks trained on a battery of both natural and synthetic text classification tasks. We find the dynamics of these trained RNNs to be both interpretable and low-dimensional. Specifically, across architectures and datasets, RNNs accumulate evidence for each class as they process the text, using a low-dimensional attractor manifold as the underlying mechanism. Moreover, the dimensionality and geometry of the attractor manifold are determined by the structure of the training dataset; in particular, we describe how simple word-count statistics computed on the training dataset can be used to predict these properties. Our observations span multiple architectures and datasets, reflecting a common mechanism RNNs employ to perform text classification. To the degree that integration of evidence towards a decision is a common computational primitive, this work lays the foundation for using dynamical systems techniques to study the inner workings of RNNs.
    Distributional Reinforcement Learning with Unconstrained Monotonic Neural Networks. (arXiv:2106.03228v2 [cs.LG] UPDATED)
    The distributional reinforcement learning (RL) approach advocates for representing the complete probability distribution of the random return instead of only modelling its expectation. A distributional RL algorithm may be characterised by two main components, namely the representation of the distribution together with its parameterisation and the probability metric defining the loss. The present research work considers the unconstrained monotonic neural network (UMNN) architecture, a universal approximator of continuous monotonic functions which is particularly well suited for modelling different representations of a distribution (PDF, CDF, QF). This property enables the efficient decoupling of the effect of the function approximator class from that of the probability metric. The research paper firstly introduces a methodology for learning different representations of the random return distribution. Secondly, a novel distributional RL algorithm named unconstrained monotonic deep Q-network (UMDQN) is presented. Lastly, in light of this new algorithm, an empirical comparison is performed between three probability quasimetrics, namely the Kullback-Leibler divergence, Cramer distance, and Wasserstein distance. The results highlight the main strengths and weaknesses associated with each probability metric together with an important limitation of the Wasserstein distance. This research concludes by calling for a reconsideration of all probability metrics in distributional RL, contrasting with the clear dominance of the Wasserstein distance in recent publications.
    SPD domain-specific batch normalization to crack interpretable unsupervised domain adaptation in EEG. (arXiv:2206.01323v1 [cs.LG])
    Electroencephalography (EEG) provides access to neuronal dynamics non-invasively with millisecond resolution, rendering it a viable method in neuroscience and healthcare. However, its utility is limited as current EEG technology does not generalize well across domains (i.e., sessions and subjects) without expensive supervised re-calibration. Contemporary methods cast this transfer learning (TL) problem as a multi-source/-target unsupervised domain adaptation (UDA) problem and address it with deep learning or shallow, Riemannian geometry aware alignment methods. Both directions have, so far, failed to consistently close the performance gap to state-of-the-art domain-specific methods based on tangent space mapping (TSM) on the symmetric positive definite (SPD) manifold. Here, we propose a theory-based machine learning framework that enables, for the first time, learning domain-invariant TSM models in an end-to-end fashion. To achieve this, we propose a new building block for geometric deep learning, which we denote SPD domain-specific momentum batch normalization (SPDDSMBN). A SPDDSMBN layer can transform domain-specific SPD inputs into domain-invariant SPD outputs, and can be readily applied to multi-source/-target and online UDA scenarios. In extensive experiments with 6 diverse EEG brain-computer interface (BCI) datasets, we obtain state-of-the-art performance in inter-session and -subject TL with a simple, intrinsically interpretable network architecture, which we denote TSMNet.
    Hybrid Models for Mixed Variables in Bayesian Optimization. (arXiv:2206.01409v1 [cs.LG])
    We systematically describe the problem of simultaneous surrogate modeling of mixed variables (i.e., continuous, integer and categorical variables) in the Bayesian optimization (BO) context. We provide a unified hybrid model using both Monte-Carlo tree search (MCTS) and Gaussian processes (GP) that encompasses and generalizes multiple state-of-the-art mixed BO surrogates. Based on the architecture, we propose applying a new dynamic model selection criterion among novel candidate families of covariance kernels, including non-stationary kernels and associated families. Different benchmark problems are studied and presented to support the superiority of our model, along with results highlighting the effectiveness of our method compared to most state-of-the-art mixed-variable methods in BO.
    Sequential Permutation Testing of Random Forest Variable Importance Measures. (arXiv:2206.01284v1 [stat.ME])
    Hypothesis testing of random forest (RF) variable importance measures (VIMP) remains the subject of ongoing research. Among recent developments, heuristic approaches to parametric testing have been proposed whose distributional assumptions are based on empirical evidence. Other formal tests under regularity conditions were derived analytically. However, these approaches can be computationally expensive or even practically infeasible. This problem also occurs with non-parametric permutation tests, which are, however, distribution-free and can generically be applied to any type of RF and VIMP. Embracing this advantage, it is proposed here to use sequential permutation tests and sequential p-value estimation to reduce the high computational costs associated with conventional permutation tests. The popular and widely used permutation VIMP serves as a practical and relevant application example. The results of simulation studies confirm that the theoretical properties of the sequential tests apply, that is, the type-I error probability is controlled at a nominal level and a high power is maintained with considerably fewer permutations needed in comparison to conventional permutation testing. The numerical stability of the methods is investigated in two additional application studies. In summary, theoretically sound sequential permutation testing of VIMP is possible at greatly reduced computational costs. Recommendations for application are given. A respective implementation is provided through the accompanying R package $rfvimptest$. The approach can also be easily applied to any kind of prediction model.
    Incremental Learning Meets Transfer Learning: Application to Multi-site Prostate MRI Segmentation. (arXiv:2206.01369v1 [cs.CV])
    Many medical datasets have recently been created for medical image segmentation tasks, and it is natural to question whether we can use them to sequentially train a single model that (1) performs better on all these datasets, and (2) generalizes well and transfers better to the unknown target site domain. Prior works have achieved this goal by jointly training one model on multi-site datasets, which achieve competitive performance on average but such methods rely on the assumption about the availability of all training data, thus limiting its effectiveness in practical deployment. In this paper, we propose a novel multi-site segmentation framework called incremental-transfer learning (ITL), which learns a model from multi-site datasets in an end-to-end sequential fashion. Specifically, "incremental" refers to training sequentially constructed datasets, and "transfer" is achieved by leveraging useful information from the linear combination of embedding features on each dataset. In addition, we introduce our ITL framework, where we train the network including a site-agnostic encoder with pre-trained weights and at most two segmentation decoder heads. We also design a novel site-level incremental loss in order to generalize well on the target domain. Second, we show for the first time that leveraging our ITL training scheme is able to alleviate challenging catastrophic forgetting problems in incremental learning. We conduct experiments using five challenging benchmark datasets to validate the effectiveness of our incremental-transfer learning approach. Our approach makes minimal assumptions on computation resources and domain-specific expertise, and hence constitutes a strong starting point in multi-site medical image segmentation.
    GASP, a generalized framework for agglomerative clustering of signed graphs and its application to Instance Segmentation. (arXiv:1906.11713v2 [cs.CV] UPDATED)
    We propose a theoretical framework that generalizes simple and fast algorithms for hierarchical agglomerative clustering to weighted graphs with both attractive and repulsive interactions between the nodes. This framework defines GASP, a Generalized Algorithm for Signed graph Partitioning, and allows us to explore many combinations of different linkage criteria and cannot-link constraints. We prove the equivalence of existing clustering methods to some of those combinations and introduce new algorithms for combinations that have not been studied before. We study both theoretical and empirical properties of these combinations and prove that some of these define an ultrametric on the graph. We conduct a systematic comparison of various instantiations of GASP on a large variety of both synthetic and existing signed clustering problems, in terms of accuracy but also efficiency and robustness to noise. Lastly, we show that some of the algorithms included in our framework, when combined with the predictions from a CNN model, result in a simple bottom-up instance segmentation pipeline. Going all the way from pixels to final segments with a simple procedure, we achieve state-of-the-art accuracy on the CREMI 2016 EM segmentation benchmark without requiring domain-specific superpixels.
    Equivariant Reinforcement Learning for Quadrotor UAV. (arXiv:2206.01233v1 [cs.LG])
    This paper presents an equivariant reinforcement learning framework for quadrotor unmanned aerial vehicles. Successful training of reinforcement learning often requires numerous interactions with the environments, which hinders its applicability especially when the available computational resources are limited, or when there is no reliable simulation model. We identified an equivariance property of the quadrotor dynamics such that the dimension of the state required in the training is reduced by one, thereby improving the sampling efficiency of reinforcement learning substantially. This is illustrated by numerical examples with popular reinforcement learning techniques of TD3 and SAC.
    Continuous Control with Action Quantization from Demonstrations. (arXiv:2110.10149v2 [cs.LG] UPDATED)
    In this paper, we propose a novel Reinforcement Learning (RL) framework for problems with continuous action spaces: Action Quantization from Demonstrations (AQuaDem). The proposed approach consists in learning a discretization of continuous action spaces from human demonstrations. This discretization returns a set of plausible actions (in light of the demonstrations) for each input state, thus capturing the priors of the demonstrator and their multimodal behavior. By discretizing the action space, any discrete action deep RL technique can be readily applied to the continuous control problem. Experiments show that the proposed approach outperforms state-of-the-art methods such as SAC in the RL setup, and GAIL in the Imitation Learning setup. We provide a website with interactive videos: https://google-research.github.io/aquadem/ and make the code available: https://github.com/google-research/google-research/tree/master/aquadem.
    Multiband VAE: Latent Space Alignment for Knowledge Consolidation in Continual Learning. (arXiv:2106.12196v2 [cs.LG] UPDATED)
    We propose a new method for unsupervised generative continual learning through realignment of Variational Autoencoder's latent space. Deep generative models suffer from catastrophic forgetting in the same way as other neural structures. Recent generative continual learning works approach this problem and try to learn from new data without forgetting previous knowledge. However, those methods usually focus on artificial scenarios where examples share almost no similarity between subsequent portions of data - an assumption not realistic in the real-life applications of continual learning. In this work, we identify this limitation and posit the goal of generative continual learning as a knowledge accumulation task. We solve it by continuously aligning latent representations of new data that we call bands in additional latent space where examples are encoded independently of their source task. In addition, we introduce a method for controlled forgetting of past data that simplifies this process. On top of the standard continual learning benchmarks, we propose a novel challenging knowledge consolidation scenario and show that the proposed approach outperforms state-of-the-art by up to twofold across all experiments and the additional real-life evaluation. To our knowledge, Multiband VAE is the first method to show forward and backward knowledge transfer in generative continual learning.
    Towards Evading the Limits of Randomized Smoothing: A Theoretical Analysis. (arXiv:2206.01715v1 [cs.LG])
    Randomized smoothing is the dominant standard for provable defenses against adversarial examples. Nevertheless, this method has recently been proven to suffer from important information theoretic limitations. In this paper, we argue that these limitations are not intrinsic, but merely a byproduct of current certification methods. We first show that these certificates use too little information about the classifier, and are in particular blind to the local curvature of the decision boundary. This leads to severely sub-optimal robustness guarantees as the dimension of the problem increases. We then show that it is theoretically possible to bypass this issue by collecting more information about the classifier. More precisely, we show that it is possible to approximate the optimal certificate with arbitrary precision, by probing the decision boundary with several noise distributions. Since this process is executed at certification time rather than at test time, it entails no loss in natural accuracy while enhancing the quality of the certificates. This result fosters further research on classifier-specific certification and demonstrates that randomized smoothing is still worth investigating. Although classifier-specific certification may induce more computational cost, we also provide some theoretical insight on how to mitigate it.
    Scalar is Not Enough: Vectorization-based Unbiased Learning to Rank. (arXiv:2206.01702v1 [cs.IR])
    Unbiased learning to rank (ULTR) aims to train an unbiased ranking model from biased user click logs. Most of the current ULTR methods are based on the examination hypothesis (EH), which assumes that the click probability can be factorized into two scalar functions, one related to ranking features and the other related to bias factors. Unfortunately, the interactions among features, bias factors and clicks are complicated in practice, and usually cannot be factorized in this independent way. Fitting click data with EH could lead to model misspecification and bring the approximation error. In this paper, we propose a vector-based EH and formulate the click probability as a dot product of two vector functions. This solution is complete due to its universality in fitting arbitrary click functions. Based on it, we propose a novel model named Vectorization to adaptively learn the relevance embeddings and sort documents by projecting embeddings onto a base vector. Extensive experiments show that our method significantly outperforms the state-of-the-art ULTR methods on complex real clicks as well as simple simulated clicks.
    HEX: Human-in-the-loop Explainability via Deep Reinforcement Learning. (arXiv:2206.01343v1 [cs.LG])
    The use of machine learning (ML) models in decision-making contexts, particularly those used in high-stakes decision-making, are fraught with issue and peril since a person - not a machine - must ultimately be held accountable for the consequences of the decisions made using such systems. Machine learning explainability (MLX) promises to provide decision-makers with prediction-specific rationale, assuring them that the model-elicited predictions are made for the right reasons and are thus reliable. Few works explicitly consider this key human-in-the-loop (HITL) component, however. In this work we propose HEX, a human-in-the-loop deep reinforcement learning approach to MLX. HEX incorporates 0-distrust projection to synthesize decider specific explanation-providing policies from any arbitrary classification model. HEX is also constructed to operate in limited or reduced training data scenarios, such as those employing federated learning. Our formulation explicitly considers the decision boundary of the ML model in question, rather than the underlying training data, which is a shortcoming of many model-agnostic MLX methods. Our proposed methods thus synthesize HITL MLX policies that explicitly capture the decision boundary of the model in question for use in limited data scenarios.
    Nonstationary Bandit Learning via Predictive Sampling. (arXiv:2205.01970v2 [cs.LG] UPDATED)
    Although Thompson sampling is widely used in stationary environments, it does not effectively account for nonstationarities. To address this limitation, we propose predictive sampling, a policy that balances between exploration and exploitation in nonstationary bandit environments. It is equivalent to Thompson sampling when specialized to stationary environments, but much more effective across a range of nonstationary environments because it deprioritizes investment in acquiring information that will quickly lose relevance. To offer insight in the efficacy of predictive sampling, we establish a regret bound. This bound highlights dependence on the rate at which new information arrives to alter the environment. In addition, we conduct experiments on bandit environments with varying rates of information arrival and observe that predictive sampling outperforms Thompson sampling.
    PNODE: A memory-efficient neural ODE framework based on high-level adjoint differentiation. (arXiv:2206.01298v1 [cs.LG])
    Neural ordinary differential equations (neural ODEs) have emerged as a novel network architecture that bridges dynamical systems and deep learning. However, the gradient obtained with the continuous adjoint method in the vanilla neural ODE is not reverse-accurate. Other approaches suffer either from excessive memory requirement due to deep computational graphs or from limited choices for the time integration scheme, hampering their application to large-scale complex dynamical systems. To achieve accurate gradients without compromising memory efficiency and flexibility, we present a new neural ODE framework, PNODE, based on high-level discrete adjoint algorithmic differentiation. By leveraging discrete adjoint time integrators and advanced checkpointing strategies tailored for these integrators, PNODE can provide a balance between memory and computational costs, while computing the gradients consistently and accurately. We provide an open-source implementation based on PyTorch and PETSc, one of the most commonly used portable, scalable scientific computing libraries. We demonstrate the performance through extensive numerical experiments on image classification and continuous normalizing flow problems. We show that PNODE achieves the highest memory efficiency when compared with other reverse-accurate methods. On the image classification problems, PNODE is up to two times faster than the vanilla neural ODE and up to 2.3 times faster than the best existing reverse-accurate method. We also show that PNODE enables the use of the implicit time integration methods that are needed for stiff dynamical systems.
    One-Bit Matrix Completion with Differential Privacy. (arXiv:2110.00719v3 [cs.CR] UPDATED)
    As a prevailing collaborative filtering method for recommendation systems, one-bit matrix completion requires data collected by users to provide personalized service. Due to insidious attacks and unexpected inference, the release of users' data often raises serious privacy concerns. To address this issue, differential privacy(DP) has been widely used in standard matrix completion models. To date, however, little has been known about how to apply DP to achieve privacy protection in one-bit matrix completion. In this paper, we propose a unified framework for ensuring a strong privacy guarantee of one-bit matrix completion with DP. In our framework, we develop four different private perturbation mechanisms corresponding to different stages of one-bit matrix completion. For each mechanism, we design a privacy-preserving algorithm and provide a theoretical recovery error bound under the proper conditions. Numerical experiments on synthetic and real-world datasets demonstrate the effectiveness of our proposal. Compared to the one-bit matrix completion without privacy protection, our proposed mechanisms can maintain high-level privacy protection with marginal loss of completion accuracy.
    Understanding Deep Contrastive Learning via Coordinate-wise Optimization. (arXiv:2201.12680v4 [cs.LG] UPDATED)
    We show that Contrastive Learning (CL) under a broad family of loss functions (including InfoNCE) has a unified formulation of coordinate-wise optimization on the network parameter $\boldsymbol{\theta}$ and pairwise importance $\alpha$, where the \emph{max player} $\boldsymbol{\theta}$ learns representation for contrastiveness, and the \emph{min player} $\alpha$ puts more weights on pairs of distinct samples that share similar representations. The resulting formulation, called $\alpha$-CL, unifies not only various existing contrastive losses, which differ by how sample-pair importance $\alpha$ is constructed, but also is able to extrapolate to give novel contrastive losses beyond popular ones, opening a new avenue of contrastive loss design. These novel losses yield comparable (or better) performance on CIFAR10 and STL-10 than classic InfoNCE. Furthermore, we also analyze the max player in detail: we prove that with fixed $\alpha$, max player is equivalent to Principal Component Analysis (PCA) for deep linear network, and almost all local minima are global and rank-1, recovering optimal PCA solutions. Finally, we extend our analysis on max player to 2-layer ReLU networks, showing that its fixed points can have higher ranks.
    Code Generation Tools (Almost) for Free? A Study of Few-Shot, Pre-Trained Language Models on Code. (arXiv:2206.01335v1 [cs.SE])
    Few-shot learning with large-scale, pre-trained language models is a powerful way to answer questions about code, e.g., how to complete a given code example, or even generate code snippets from scratch. The success of these models raises the question whether they could serve as a basis for building a wide range code generation tools. Traditionally, such tools are built manually and separately for each task. Instead, few-shot learning may allow to obtain different tools from a single pre-trained language model by simply providing a few examples or a natural language description of the expected tool behavior. This paper studies to what extent a state-of-the-art, pre-trained language model of code, Codex, may serve this purpose. We consider three code manipulation and code generation tasks targeted by a range of traditional tools: (i) code mutation; (ii) test oracle generation from natural language documentation; and (iii) test case generation. For each task, we compare few-shot learning to a manually built tool. Our results show that the model-based tools complement (code mutation), are on par (test oracle generation), or even outperform their respective traditionally built tool (test case generation), while imposing far less effort to develop them. By comparing the effectiveness of different variants of the model-based tools, we provide insights on how to design an appropriate input ("prompt") to the model and what influence the size of the model has. For example, we find that providing a small natural language description of the code generation task is an easy way to improve predictions. Overall, we conclude that few-shot language models are surprisingly effective, yet there is still more work to be done, such as exploring more diverse ways of prompting and tackling even more involved tasks.
    KCRL: Krasovskii-Constrained Reinforcement Learning with Guaranteed Stability in Nonlinear Dynamical Systems. (arXiv:2206.01704v1 [cs.LG])
    Learning a dynamical system requires stabilizing the unknown dynamics to avoid state blow-ups. However, current reinforcement learning (RL) methods lack stabilization guarantees, which limits their applicability for the control of safety-critical systems. We propose a model-based RL framework with formal stability guarantees, Krasovskii Constrained RL (KCRL), that adopts Krasovskii's family of Lyapunov functions as a stability constraint. The proposed method learns the system dynamics up to a confidence interval using feature representation, e.g. Random Fourier Features. It then solves a constrained policy optimization problem with a stability constraint based on Krasovskii's method using a primal-dual approach to recover a stabilizing policy. We show that KCRL is guaranteed to learn a stabilizing policy in a finite number of interactions with the underlying unknown system. We also derive the sample complexity upper bound for stabilization of unknown nonlinear dynamical systems via the KCRL framework.
    Differentially Private Multivariate Time Series Forecasting of Aggregated Human Mobility With Deep Learning: Input or Gradient Perturbation?. (arXiv:2205.00436v2 [cs.LG] UPDATED)
    This paper investigates the problem of forecasting multivariate aggregated human mobility while preserving the privacy of the individuals concerned. Differential privacy, a state-of-the-art formal notion, has been used as the privacy guarantee in two different and independent steps when training deep learning models. On one hand, we considered \textit{gradient perturbation}, which uses the differentially private stochastic gradient descent algorithm to guarantee the privacy of each time series sample in the learning stage. On the other hand, we considered \textit{input perturbation}, which adds differential privacy guarantees in each sample of the series before applying any learning. We compared four state-of-the-art recurrent neural networks: Long Short-Term Memory, Gated Recurrent Unit, and their Bidirectional architectures, i.e., Bidirectional-LSTM and Bidirectional-GRU. Extensive experiments were conducted with a real-world multivariate mobility dataset, which we published openly along with this paper. As shown in the results, differentially private deep learning models trained under gradient or input perturbation achieve nearly the same performance as non-private deep learning models, with loss in performance varying between $0.57\%$ to $2.8\%$. The contribution of this paper is significant for those involved in urban planning and decision-making, providing a solution to the human mobility multivariate forecast problem through differentially private deep learning models.
    Impact of the composition of feature extraction and class sampling in medicare fraud detection. (arXiv:2206.01413v1 [cs.LG])
    With healthcare being critical aspect, health insurance has become an important scheme in minimizing medical expenses. Following this, the healthcare industry has seen a significant increase in fraudulent activities owing to increased insurance, and fraud has become a significant contributor to rising medical care expenses, although its impact can be mitigated using fraud detection techniques. To detect fraud, machine learning techniques are used. The Centers for Medicaid and Medicare Services (CMS) of the United States federal government released "Medicare Part D" insurance claims is utilized in this study to develop fraud detection system. Employing machine learning algorithms on a class-imbalanced and high dimensional medicare dataset is a challenging task. To compact such challenges, the present work aims to perform feature extraction following data sampling, afterward applying various classification algorithms, to get better performance. Feature extraction is a dimensionality reduction approach that converts attributes into linear or non-linear combinations of the actual attributes, generating a smaller and more diversified set of attributes and thus reducing the dimensions. Data sampling is commonlya used to address the class imbalance either by expanding the frequency of minority class or reducing the frequency of majority class to obtain approximately equal numbers of occurrences for both classes. The proposed approach is evaluated through standard performance metrics. Thus, to detect fraud efficiently, this study applies autoencoder as a feature extraction technique, synthetic minority oversampling technique (SMOTE) as a data sampling technique, and various gradient boosted decision tree-based classifiers as a classification algorithm. The experimental results show the combination of autoencoders followed by SMOTE on the LightGBM classifier achieved best results.
    BaCaDI: Bayesian Causal Discovery with Unknown Interventions. (arXiv:2206.01665v1 [cs.LG])
    Learning causal structures from observation and experimentation is a central task in many domains. For example, in biology, recent advances allow us to obtain single-cell expression data under multiple interventions such as drugs or gene knockouts. However, a key challenge is that often the targets of the interventions are uncertain or unknown. Thus, standard causal discovery methods can no longer be used. To fill this gap, we propose a Bayesian framework (BaCaDI) for discovering the causal structure that underlies data generated under various unknown experimental/interventional conditions. BaCaDI is fully differentiable and operates in the continuous space of latent probabilistic representations of both causal structures and interventions. This enables us to approximate complex posteriors via gradient-based variational inference and to reason about the epistemic uncertainty in the predicted structure. In experiments on synthetic causal discovery tasks and simulated gene-expression data, BaCaDI outperforms related methods in identifying causal structures and intervention targets. Finally, we demonstrate that, thanks to its rigorous Bayesian approach, our method provides well-calibrated uncertainty estimates.
    Offline Reinforcement Learning with Causal Structured World Models. (arXiv:2206.01474v1 [cs.LG])
    Model-based methods have recently shown promising for offline reinforcement learning (RL), aiming to learn good policies from historical data without interacting with the environment. Previous model-based offline RL methods learn fully connected nets as world-models that map the states and actions to the next-step states. However, it is sensible that a world-model should adhere to the underlying causal effect such that it will support learning an effective policy generalizing well in unseen states. In this paper, We first provide theoretical results that causal world-models can outperform plain world-models for offline RL by incorporating the causal structure into the generalization error bound. We then propose a practical algorithm, oFfline mOdel-based reinforcement learning with CaUsal Structure (FOCUS), to illustrate the feasibility of learning and leveraging causal structure in offline RL. Experimental results on two benchmarks show that FOCUS reconstructs the underlying causal structure accurately and robustly. Consequently, it performs better than the plain model-based offline RL algorithms and other causal model-based RL algorithms.
    Towards Accelerating Training of Batch Normalization: A Manifold Perspective. (arXiv:2101.02916v2 [cs.LG] UPDATED)
    Batch normalization (BN) has become a critical component across diverse deep neural networks. The network with BN is invariant to positively linear re-scale transformation, which makes there exist infinite functionally equivalent networks with different scales of weights. However, optimizing these equivalent networks with the first-order method such as stochastic gradient descent will obtain a series of iterates converging to different local optima owing to their different gradients across training. To obviate this, we propose a quotient manifold \emph{PSI manifold}, in which all the equivalent weights of the network with BN are regarded as the same element. Next, we construct gradient descent and stochastic gradient descent on the proposed PSI manifold to train the network with BN. The two algorithms guarantee that every group of equivalent weights (caused by positively re-scaling) converge to the equivalent optima. Besides that, we give convergence rates of the proposed algorithms on the PSI manifold. The results show that our methods accelerate training compared with the algorithms on the Euclidean weight space. Finally, empirical results verify that our algorithms consistently improve the existing methods in both convergence rate and generalization ability under various experimental settings.
    Indirect Active Learning. (arXiv:2206.01454v1 [math.ST])
    Traditional models of active learning assume a learner can directly manipulate or query a covariate $X$ in order to study its relationship with a response $Y$. However, if $X$ is a feature of a complex system, it may be possible only to indirectly influence $X$ by manipulating a control variable $Z$, a scenario we refer to as Indirect Active Learning. Under a nonparametric model of Indirect Active Learning with a fixed budget, we study minimax convergence rates for estimating the relationship between $X$ and $Y$ locally at a point, obtaining different rates depending on the complexities and noise levels of the relationships between $Z$ and $X$ and between $X$ and $Y$. We also identify minimax rates for passive learning under comparable assumptions. In many cases, our results show that, while there is an asymptotic benefit to active learning, this benefit is fully realized by a simple two-stage learner that runs two passive experiments in sequence.
    On the Privacy Properties of GAN-generated Samples. (arXiv:2206.01349v1 [cs.LG])
    The privacy implications of generative adversarial networks (GANs) are a topic of great interest, leading to several recent algorithms for training GANs with privacy guarantees. By drawing connections to the generalization properties of GANs, we prove that under some assumptions, GAN-generated samples inherently satisfy some (weak) privacy guarantees. First, we show that if a GAN is trained on m samples and used to generate n samples, the generated samples are (epsilon, delta)-differentially-private for (epsilon, delta) pairs where delta scales as O(n/m). We show that under some special conditions, this upper bound is tight. Next, we study the robustness of GAN-generated samples to membership inference attacks. We model membership inference as a hypothesis test in which the adversary must determine whether a given sample was drawn from the training dataset or from the underlying data distribution. We show that this adversary can achieve an area under the ROC curve that scales no better than O(m^{-1/4}).
    Expressiveness and Learnability: A Unifying View for Evaluating Self-Supervised Learning. (arXiv:2206.01251v1 [cs.LG])
    We propose a unifying view to analyze the representation quality of self-supervised learning (SSL) models without access to supervised labels, while being agnostic to the architecture, learning algorithm or data manipulation used during training. We argue that representations can be evaluated through the lens of expressiveness and learnability. We propose to use the Intrinsic Dimension (ID) to assess expressiveness and introduce Cluster Learnability (CL) to assess learnability. CL is measured as the learning speed of a KNN classifier trained to predict labels obtained by clustering the representations with K-means. We thus combine CL and ID into a single predictor: CLID. Through a large-scale empirical study with a diverse family of SSL algorithms, we find that CLID better correlates with in-distribution model performance than other competing recent evaluation schemes. We also benchmark CLID on out-of-domain generalization, where CLID serves as a predictor of the transfer performance of SSL models on several classification tasks, yielding improvements with respect to the competing baselines.
    Optimal Activation Functions for the Random Features Regression Model. (arXiv:2206.01332v1 [stat.ML])
    The asymptotic mean squared test error and sensitivity of the Random Features Regression model (RFR) have been recently studied. We build on this work and identify in closed-form the family of Activation Functions (AFs) that minimize a combination of the test error and sensitivity of the RFR under different notions of functional parsimony. We find scenarios under which the optimal AFs are linear, saturated linear functions, or expressible in terms of Hermite polynomials. Finally, we show how using optimal AFs impacts well-established properties of the RFR model, such as its double descent curve, and the dependency of its optimal regularization parameter on the observation noise level.
    A Learning-Based Method for Automatic Operator Selection in the Fanoos XAI System. (arXiv:2206.01722v1 [cs.LG])
    We describe an extension of the Fanoos XAI system [Bayani et al 2022] which enables the system to learn the appropriate action to take in order to satisfy a user's request for description to be made more or less abstract. Specifically, descriptions of systems under analysis are stored in states, and in order to make a description more or less abstract, Fanoos selects an operator from a large library to apply to the state and generate a new description. Prior work on Fanoos predominately used hand-written methods for operator-selection; this current work allows Fanoos to leverage experience to learn the best operator to apply in a particular situation, balancing exploration and exploitation, leveraging expert insights when available, and utilizing similarity between the current state and past states. Additionally, in order to bootstrap the learning process (i.e., like in curriculum learning), we describe a simulated user which we implemented; this simulation allows Fanoos to gain general insights that enable reasonable courses of action, insights which later can be refined by experience with real users, as opposed to interacting with humans completely from scratch. Code implementing the methods described in the paper can be found at https://github/DBay-ani/Operator_Selection_Learning_Extensions_For_Fanoos.
    Pay attention to your loss: understanding misconceptions about 1-Lipschitz neural networks. (arXiv:2104.05097v5 [cs.LG] UPDATED)
    Lipschitz constrained networks have gathered considerable attention in deep learning community, with usages ranging from Wasserstein distance estimation to the training of certifiably robust classifiers. However they remain commonly considered as less accurate, and their properties in learning are still not fully understood. In this paper we clarify the matter: when it comes to classification 1-Lipschitz neural networks enjoy several advantages over their unconstrained counterpart. First, we show that these networks are as accurate as classical ones, and can fit arbitrarily difficult boundaries. Then, relying on a robustness metric which reflects operational needs we characterize the most robust classifier: the WGAN discriminator. Next, we show that 1-Lipschitz neural networks generalize well under milder assumptions. Finally, we show that hyper-parameters of the loss are crucial for controlling the accuracy-robustness trade-off. We conclude that they exhibit appealing properties to pave the way toward provably accurate, and provably robust neural networks.
    Algorithmic Stability of Heavy-Tailed Stochastic Gradient Descent on Least Squares. (arXiv:2206.01274v1 [stat.ML])
    Recent studies have shown that heavy tails can emerge in stochastic optimization and that the heaviness of the tails has links to the generalization error. While these studies have shed light on interesting aspects of the generalization behavior in modern settings, they relied on strong topological and statistical regularity assumptions, which are hard to verify in practice. Furthermore, it has been empirically illustrated that the relation between heavy tails and generalization might not always be monotonic in practice, contrary to the conclusions of existing theory. In this study, we establish novel links between the tail behavior and generalization properties of stochastic gradient descent (SGD), through the lens of algorithmic stability. We consider a quadratic optimization problem and use a heavy-tailed stochastic differential equation as a proxy for modeling the heavy-tailed behavior emerging in SGD. We then prove uniform stability bounds, which reveal the following outcomes: (i) Without making any exotic assumptions, we show that SGD will not be stable if the stability is measured with the squared-loss $x\mapsto x^2$, whereas it in turn becomes stable if the stability is instead measured with a surrogate loss $x\mapsto |x|^p$ with some $p<2$. (ii) Depending on the variance of the data, there exists a \emph{`threshold of heavy-tailedness'} such that the generalization error decreases as the tails become heavier, as long as the tails are lighter than this threshold. This suggests that the relation between heavy tails and generalization is not globally monotonic. (iii) We prove matching lower-bounds on uniform stability, implying that our bounds are tight in terms of the heaviness of the tails. We support our theory with synthetic and real neural network experiments.
    Dynamic Kernel Selection for Improved Generalization and Memory Efficiency in Meta-learning. (arXiv:2206.01690v1 [cs.LG])
    Gradient based meta-learning methods are prone to overfit on the meta-training set, and this behaviour is more prominent with large and complex networks. Moreover, large networks restrict the application of meta-learning models on low-power edge devices. While choosing smaller networks avoid these issues to a certain extent, it affects the overall generalization leading to reduced performance. Clearly, there is an approximately optimal choice of network architecture that is best suited for every meta-learning problem, however, identifying it beforehand is not straightforward. In this paper, we present MetaDOCK, a task-specific dynamic kernel selection strategy for designing compressed CNN models that generalize well on unseen tasks in meta-learning. Our method is based on the hypothesis that for a given set of similar tasks, not all kernels of the network are needed by each individual task. Rather, each task uses only a fraction of the kernels, and the selection of the kernels per task can be learnt dynamically as a part of the inner update steps. MetaDOCK compresses the meta-model as well as the task-specific inner models, thus providing significant reduction in model size for each task, and through constraining the number of active kernels for every task, it implicitly mitigates the issue of meta-overfitting. We show that for the same inference budget, pruned versions of large CNN models obtained using our approach consistently outperform the conventional choices of CNN models. MetaDOCK couples well with popular meta-learning approaches such as iMAML. The efficacy of our method is validated on CIFAR-fs and mini-ImageNet datasets, and we have observed that our approach can provide improvements in model accuracy of up to 2% on standard meta-learning benchmark, while reducing the model size by more than 75%.
    On Calibration of Graph Neural Networks for Node Classification. (arXiv:2206.01570v1 [cs.LG])
    Graphs can model real-world, complex systems by representing entities and their interactions in terms of nodes and edges. To better exploit the graph structure, graph neural networks have been developed, which learn entity and edge embeddings for tasks such as node classification and link prediction. These models achieve good performance with respect to accuracy, but the confidence scores associated with the predictions might not be calibrated. That means that the scores might not reflect the ground-truth probabilities of the predicted events, which would be especially important for safety-critical applications. Even though graph neural networks are used for a wide range of tasks, the calibration thereof has not been sufficiently explored yet. We investigate the calibration of graph neural networks for node classification, study the effect of existing post-processing calibration methods, and analyze the influence of model capacity, graph density, and a new loss function on calibration. Further, we propose a topology-aware calibration method that takes the neighboring nodes into account and yields improved calibration compared to baseline methods.
    Excess risk analysis for epistemic uncertainty with application to variational inference. (arXiv:2206.01606v1 [stat.ML])
    We analyze the epistemic uncertainty (EU) of supervised learning in Bayesian inference by focusing on the excess risk. Existing analysis is limited to the Bayesian setting, which assumes a correct model and exact Bayesian posterior distribution. Thus we cannot apply the existing theory to modern Bayesian algorithms, such as variational inference. To address this, we present a novel EU analysis in the frequentist setting, where data is generated from an unknown distribution. We show a relation between the generalization ability and the widely used EU measurements, such as the variance and entropy of the predictive distribution. Then we show their convergence behaviors theoretically. Finally, we propose new variational inference that directly controls the prediction and EU evaluation performances based on the PAC-Bayesian theory. Numerical experiments show that our algorithm significantly improves the EU evaluation over the existing methods.
    PROMISSING: Pruning Missing Values in Neural Networks. (arXiv:2206.01640v1 [cs.LG])
    While data are the primary fuel for machine learning models, they often suffer from missing values, especially when collected in real-world scenarios. However, many off-the-shelf machine learning models, including artificial neural network models, are unable to handle these missing values directly. Therefore, extra data preprocessing and curation steps, such as data imputation, are inevitable before learning and prediction processes. In this study, we propose a simple and intuitive yet effective method for pruning missing values (PROMISSING) during learning and inference steps in neural networks. In this method, there is no need to remove or impute the missing values; instead, the missing values are treated as a new source of information (representing what we do not know). Our experiments on simulated data, several classification and regression benchmarks, and a multi-modal clinical dataset show that PROMISSING results in similar prediction performance compared to various imputation techniques. In addition, our experiments show models trained using PROMISSING techniques are becoming less decisive in their predictions when facing incomplete samples with many unknowns. This finding hopefully advances machine learning models from being pure predicting machines to more realistic thinkers that can also say "I do not know" when facing incomplete sources of information.
    Rate-Optimal Online Convex Optimization in Adaptive Linear Control. (arXiv:2206.01426v1 [cs.LG])
    We consider the problem of controlling an unknown linear dynamical system under adversarially changing convex costs and full feedback of both the state and cost function. We present the first computationally-efficient algorithm that attains an optimal $\smash{\sqrt{T}}$-regret rate compared to the best stabilizing linear controller in hindsight, while avoiding stringent assumptions on the costs such as strong convexity. Our approach is based on a careful design of non-convex lower confidence bounds for the online costs, and uses a novel technique for computationally-efficient regret minimization of these bounds that leverages their particular non-convex structure.
    Lossy Gradient Compression: How Much Accuracy Can One Bit Buy?. (arXiv:2202.02812v2 [cs.LG] UPDATED)
    In federated learning (FL), a global model is trained at a Parameter Server (PS) by aggregating model updates obtained from multiple remote learners. Generally, the communication between the remote users and the PS is rate-limited, while the transmission from the PS to the remote users are unconstrained. The FL setting gives rise to the distributed learning scenario in which the updates from the remote learners have to be compressed so as to meet communication rate constraints in the uplink transmission toward the PS. For this problem, one wishes to compress the model updates so as to minimize the loss in accuracy resulting from the compression error. In this paper, we take a rate-distortion approach to address the compressor design problem for the distributed training of deep neural networks (DNNs). In particular, we define a measure of the compression performance under communication-rate constraints -- the \emph{per-bit accuracy} -- which addresses the ultimate improvement of accuracy that a bit of communication brings to the centralized model. In order to maximize the per-bit accuracy, we consider modeling the DNN gradient updates at remote learners as a generalized normal distribution. Under this assumption on the DNN gradient distribution, we propose a class of distortion measures to aid the design of quantizers for the compression of the model updates. We argue that this family of distortion measures, which we refer to as "$M$-magnitude weighted $L_2$" norm, captures the practitioner's intuition in the choice of gradient compressor. Numerical simulations are provided to validate the proposed approach for the CIFAR-10 dataset.
    Functional Connectivity Methods for EEG-based Biometrics on a Large, Heterogeneous Dataset. (arXiv:2206.01475v1 [eess.SP])
    This study examines the utility of functional connectivity (FC) and graph-based (GB) measures with a support vector machine classifier for use in electroencephalogram (EEG) based biometrics. Although FC-based features have been used in biometric applications, studies assessing the identification algorithms on heterogeneous and large datasets are scarce. This work investigates the performance of FC and GB metrics on a dataset of 184 subjects formed by pooling three datasets recorded under different protocols and acquisition systems. The results demonstrate the higher discriminatory power of FC than GB metrics. The identification accuracy increases with higher frequency EEG bands, indicating the enhanced uniqueness of the neural signatures in beta and gamma bands. Using all the 56 EEG channels common to the three databases, the best identification accuracy of 97.4% is obtained using phase-locking value (PLV) based measures extracted from the gamma frequency band. Further, we investigate the effect of the length of the analysis epoch to determine the data acquisition time required to obtain satisfactory identification accuracy. When the number of channels is reduced to 21 from 56, there is a marginal reduction of 2.4% only in the identification accuracy using PLV features in the gamma band. Additional experiments have been conducted to study the effect of the cognitive state of the subject and mismatched train/test conditions on the performance of the system.
    Prescriptive maintenance with causal machine learning. (arXiv:2206.01562v1 [econ.GN])
    Machine maintenance is a challenging operational problem, where the goal is to plan sufficient preventive maintenance to avoid machine failures and overhauls. Maintenance is often imperfect in reality and does not make the asset as good as new. Although a variety of imperfect maintenance policies have been proposed in the literature, these rely on strong assumptions regarding the effect of maintenance on the machine's condition, assuming the effect is (1) deterministic or governed by a known probability distribution, and (2) machine-independent. This work proposes to relax both assumptions by learning the effect of maintenance conditional on a machine's characteristics from observational data on similar machines using existing methodologies for causal inference. By predicting the maintenance effect, we can estimate the number of overhauls and failures for different levels of maintenance and, consequently, optimize the preventive maintenance frequency to minimize the total estimated cost. We validate our proposed approach using real-life data on more than 4,000 maintenance contracts from an industrial partner. Empirical results show that our novel, causal approach accurately predicts the maintenance effect and results in individualized maintenance schedules that are more accurate and cost-effective than supervised or non-individualized approaches.  ( 2 min )
    Algorithm for Constrained Markov Decision Process with Linear Convergence. (arXiv:2206.01666v1 [math.OC])
    The problem of constrained Markov decision process is considered. An agent aims to maximize the expected accumulated discounted reward subject to multiple constraints on its costs (the number of constraints is relatively small). A new dual approach is proposed with the integration of two ingredients: entropy regularized policy optimizer and Vaidya's dual optimizer, both of which are critical to achieve faster convergence. The finite-time error bound of the proposed approach is provided. Despite the challenge of the nonconcave objective subject to nonconcave constraints, the proposed approach is shown to converge (with linear rate) to the global optimum. The complexity expressed in terms of the optimality gap and the constraint violation significantly improves upon the existing primal-dual approaches.  ( 2 min )
    Measuring Gender Bias in Word Embeddings of Gendered Languages Requires Disentangling Grammatical Gender Signals. (arXiv:2206.01691v1 [cs.CY])
    Does the grammatical gender of a language interfere when measuring the semantic gender information captured by its word embeddings? A number of anomalous gender bias measurements in the embeddings of gendered languages suggest this possibility. We demonstrate that word embeddings learn the association between a noun and its grammatical gender in grammatically gendered languages, which can skew social gender bias measurements. Consequently, word embedding post-processing methods are introduced to quantify, disentangle, and evaluate grammatical gender signals. The evaluation is performed on five gendered languages from the Germanic, Romance, and Slavic branches of the Indo-European language family. Our method reduces the strength of grammatical gender signals, which is measured in terms of effect size (Cohen's d), by a significant average of d = 1.3 for French, German, and Italian, and d = 0.56 for Polish and Spanish. Once grammatical gender is disentangled, the association between over 90% of 10,000 inanimate nouns and their assigned grammatical gender weakens, and cross-lingual bias results from the Word Embedding Association Test (WEAT) become more congruent with country-level implicit bias measurements. The results further suggest that disentangling grammatical gender signals from word embeddings may lead to improvement in semantic machine learning tasks.  ( 2 min )
    Joint Energy Dispatch and Unit Commitment in Microgrids Based on Deep Reinforcement Learning. (arXiv:2206.01663v1 [cs.LG])
    Nowadays, the application of microgrids (MG) with renewable energy is becoming more and more extensive, which creates a strong need for dynamic energy management. In this paper, deep reinforcement learning (DRL) is applied to learn an optimal policy for making joint energy dispatch (ED) and unit commitment (UC) decisions in an isolated MG, with the aim for reducing the total power generation cost on the premise of ensuring the supply-demand balance. In order to overcome the challenge of discrete-continuous hybrid action space due to joint ED and UC, we propose a DRL algorithm, i.e., the hybrid action finite-horizon DDPG (HAFH-DDPG), that seamlessly integrates two classical DRL algorithms, i.e., deep Q-network (DQN) and deep deterministic policy gradient (DDPG), based on a finite-horizon dynamic programming (DP) framework. Moreover, a diesel generator (DG) selection strategy is presented to support a simplified action space for reducing the computation complexity of this algorithm. Finally, the effectiveness of our proposed algorithm is verified through comparison with several baseline algorithms by experiments with real-world data set.  ( 2 min )
    Decentralized Optimistic Hyperpolicy Mirror Descent: Provably No-Regret Learning in Markov Games. (arXiv:2206.01588v1 [cs.LG])
    We study decentralized policy learning in Markov games where we control a single agent to play with nonstationary and possibly adversarial opponents. Our goal is to develop a no-regret online learning algorithm that (i) takes actions based on the local information observed by the agent and (ii) is able to find the best policy in hindsight. For such a problem, the nonstationary state transitions due to the varying opponent pose a significant challenge. In light of a recent hardness result \citep{liu2022learning}, we focus on the setting where the opponent's previous policies are revealed to the agent for decision making. With such an information structure, we propose a new algorithm, \underline{D}ecentralized \underline{O}ptimistic hype\underline{R}policy m\underline{I}rror de\underline{S}cent (DORIS), which achieves $\sqrt{K}$-regret in the context of general function approximation, where $K$ is the number of episodes. Moreover, when all the agents adopt DORIS, we prove that their mixture policy constitutes an approximate coarse correlated equilibrium. In particular, DORIS maintains a \textit{hyperpolicy} which is a distribution over the policy space. The hyperpolicy is updated via mirror descent, where the update direction is obtained by an optimistic variant of least-squares policy evaluation. Furthermore, to illustrate the power of our method, we apply DORIS to constrained and vector-valued MDPs, which can be formulated as zero-sum Markov games with a fictitious opponent.  ( 2 min )
    OmniXAI: A Library for Explainable AI. (arXiv:2206.01612v1 [cs.LG])
    We introduce OmniXAI, an open-source Python library of eXplainable AI (XAI), which offers omni-way explainable AI capabilities and various interpretable machine learning techniques to address the pain points of understanding and interpreting the decisions made by machine learning (ML) in practice. OmniXAI aims to be a one-stop comprehensive library that makes explainable AI easy for data scientists, ML researchers and practitioners who need explanation for various types of data, models and explanation methods at different stages of ML process (data exploration, feature engineering, model development, evaluation, and decision-making, etc). In particular, our library includes a rich family of explanation methods integrated in a unified interface, which supports multiple data types (tabular data, images, texts, time-series), multiple types of ML models (traditional ML in Scikit-learn and deep learning models in PyTorch/TensorFlow), and a range of diverse explanation methods including "model-specific" and "model-agnostic" ones (such as feature-attribution explanation, counterfactual explanation, gradient-based explanation, etc). For practitioners, the library provides an easy-to-use unified interface to generate the explanations for their applications by only writing a few lines of codes, and also a GUI dashboard for visualization of different explanations for more insights about decisions. In this technical report, we present OmniXAI's design principles, system architectures, and major functionalities, and also demonstrate several example use cases across different types of data, tasks, and models.  ( 2 min )
    MCD: Marginal Contrastive Discrimination for conditional density estimation. (arXiv:2206.01592v1 [stat.ML])
    We consider the problem of conditional density estimation, which is a major topic of interest in the fields of statistical and machine learning. Our method, called Marginal Contrastive Discrimination, MCD, reformulates the conditional density function into two factors, the marginal density function of the target variable and a ratio of density functions which can be estimated through binary classification. Like noise-contrastive methods, MCD can leverage state-of-the-art supervised learning techniques to perform conditional density estimation, including neural networks. Our benchmark reveals that our method significantly outperforms in practice existing methods on most density models and regression datasets.  ( 2 min )
    A Comparative Study on Energy Consumption Models for Drones. (arXiv:2206.01609v1 [cs.RO])
    Creating an appropriate energy consumption prediction model is becoming an important topic for drone-related research in the literature. However, a general consensus on the energy consumption model is yet to be reached at present. As a result, there are many variations that attempt to create models that range in complexity with a focus on different aspects. In this paper, we benchmark the five most popular energy consumption models for drones derived from their physical behaviours and point to the difficulties in matching with a realistic energy dataset collected from a delivery drone in flight under different testing conditions. Moreover, we propose a novel data-driven energy model using the Long Short-Term Memory (LSTM) based deep learning architecture and the accuracy is compared based on the dataset. Our experimental results have shown that the LSTM based approach can easily outperform other mathematical models for the dataset under study. Finally, sensitivity analysis has been carried out in order to interpret the model.  ( 2 min )
    Pruning for Interpretable, Feature-Preserving Circuits in CNNs. (arXiv:2206.01627v1 [cs.CV])
    Deep convolutional neural networks are a powerful model class for a range of computer vision problems, but it is difficult to interpret the image filtering process they implement, given their sheer size. In this work, we introduce a method for extracting 'feature-preserving circuits' from deep CNNs, leveraging methods from saliency-based neural network pruning. These circuits are modular sub-functions, embedded within the network, containing only a subset of convolutional kernels relevant to a target feature. We compare the efficacy of 3 saliency-criteria for extracting these sparse circuits. Further, we show how 'sub-feature' circuits can be extracted, that preserve a feature's responses to particular images, dividing the feature into even sparser filtering processes. We also develop a tool for visualizing 'circuit diagrams', which render the entire image filtering process implemented by circuits in a parsable format.  ( 2 min )
    Truly Mesh-free Physics-Informed Neural Networks. (arXiv:2206.01545v1 [cs.LG])
    Physics-informed Neural Networks (PINNs) have recently emerged as a principled way to include prior physical knowledge in form of partial differential equations (PDEs) into neural networks. Although generally viewed as being mesh-free, current approaches still rely on collocation points obtained within a bounded region, even in settings with spatially sparse signals. Furthermore, if the boundaries are not known, the selection of such a region may be arbitrary, resulting in a large proportion of collocation points being selected in areas of low relevance. To resolve this, we present a mesh-free and adaptive approach termed particle-density PINN (pdPINN), which is inspired by the microscopic viewpoint of fluid dynamics. Instead of sampling from a bounded region, we propose to sample directly from the distribution over the (fluids) particle positions, eliminating the need to introduce boundaries while adaptively focusing on the most relevant regions. This is achieved by reformulating the modeled fluid density as an unnormalized probability distribution from which we sample with dynamic Monte Carlo methods. We further generalize pdPINNs to different settings that allow interpreting a positive scalar quantity as a particle density, such as the evolution of the temperature in the heat equation. The utility of our approach is demonstrated on experiments for modeling (non-steady) compressible fluids in up to three dimensions and a two-dimensional diffusion problem, illustrating the high flexibility and sample efficiency compared to existing refinement methods for PINNs.  ( 2 min )
    Neural Differential Equations for Learning to Program Neural Nets Through Continuous Learning Rules. (arXiv:2206.01649v1 [cs.LG])
    Neural ordinary differential equations (ODEs) have attracted much attention as continuous-time counterparts of deep residual neural networks (NNs), and numerous extensions for recurrent NNs have been proposed. Since the 1980s, ODEs have also been used to derive theoretical results for NN learning rules, e.g., the famous connection between Oja's rule and principal component analysis. Such rules are typically expressed as additive iterative update processes which have straightforward ODE counterparts. Here we introduce a novel combination of learning rules and Neural ODEs to build continuous-time sequence processing nets that learn to manipulate short-term memory in rapidly changing synaptic connections of other nets. This yields continuous-time counterparts of Fast Weight Programmers and linear Transformers. Our novel models outperform the best existing Neural Controlled Differential Equation based models on various time series classification tasks, while also addressing their scalability limitations. Our code is public.  ( 2 min )
    Beyond Tabula Rasa: Reincarnating Reinforcement Learning. (arXiv:2206.01626v1 [cs.LG])
    Learning tabula rasa, that is without any prior knowledge, is the prevalent workflow in reinforcement learning (RL) research. However, RL systems, when applied to large-scale settings, rarely operate tabula rasa. Such large-scale systems undergo multiple design or algorithmic changes during their development cycle and use ad hoc approaches for incorporating these changes without re-training from scratch, which would have been prohibitively expensive. Additionally, the inefficiency of deep RL typically excludes researchers without access to industrial-scale resources from tackling computationally-demanding problems. To address these issues, we present reincarnating RL as an alternative workflow, where prior computational work (e.g., learned policies) is reused or transferred between design iterations of an RL agent, or from one RL agent to another. As a step towards enabling reincarnating RL from any agent to any other agent, we focus on the specific setting of efficiently transferring an existing sub-optimal policy to a standalone value-based RL agent. We find that existing approaches fail in this setting and propose a simple algorithm to address their limitations. Equipped with this algorithm, we demonstrate reincarnating RL's gains over tabula rasa RL on Atari 2600 games, a challenging locomotion task, and the real-world problem of navigating stratospheric balloons. Overall, this work argues for an alternative approach to RL research, which we believe could significantly improve real-world RL adoption and help democratize it further.  ( 2 min )
    Beyond Opinion Mining: Summarizing Opinions of Customer Reviews. (arXiv:2206.01543v1 [cs.CL])
    Customer reviews are vital for making purchasing decisions in the Information Age. Such reviews can be automatically summarized to provide the user with an overview of opinions. In this tutorial, we present various aspects of opinion summarization that are useful for researchers and practitioners. First, we will introduce the task and major challenges. Then, we will present existing opinion summarization solutions, both pre-neural and neural. We will discuss how summarizers can be trained in the unsupervised, few-shot, and supervised regimes. Each regime has roots in different machine learning methods, such as auto-encoding, controllable text generation, and variational inference. Finally, we will discuss resources and evaluation methods and conclude with the future directions. This three-hour tutorial will provide a comprehensive overview over major advances in opinion summarization. The listeners will be well-equipped with the knowledge that is both useful for research and practical applications.  ( 2 min )
    RACA: Relation-Aware Credit Assignment for Ad-Hoc Cooperation in Multi-Agent Deep Reinforcement Learning. (arXiv:2206.01207v1 [cs.LG])
    In recent years, reinforcement learning has faced several challenges in the multi-agent domain, such as the credit assignment issue. Value function factorization emerges as a promising way to handle the credit assignment issue under the centralized training with decentralized execution (CTDE) paradigm. However, existing value function factorization methods cannot deal with ad-hoc cooperation, that is, adapting to new configurations of teammates at test time. Specifically, these methods do not explicitly utilize the relationship between agents and cannot adapt to different sizes of inputs. To address these limitations, we propose a novel method, called Relation-Aware Credit Assignment (RACA), which achieves zero-shot generalization in ad-hoc cooperation scenarios. RACA takes advantage of a graph-based relation encoder to encode the topological structure between agents. Furthermore, RACA utilizes an attention-based observation abstraction mechanism that can generalize to an arbitrary number of teammates with a fixed number of parameters. Experiments demonstrate that our method outperforms baseline methods on the StarCraftII micromanagement benchmark and ad-hoc cooperation scenarios.  ( 2 min )
    Accelerated first-order methods for convex optimization with locally Lipschitz continuous gradient. (arXiv:2206.01209v1 [math.OC])
    In this paper we develop accelerated first-order methods for convex optimization with locally Lipschitz continuous gradient (LLCG), which is beyond the well-studied class of convex optimization with Lipschitz continuous gradient. In particular, we first consider unconstrained convex optimization with LLCG and propose accelerated proximal gradient (APG) methods for solving it. The proposed APG methods are equipped with a verifiable termination criterion and enjoy an operation complexity of ${\cal O}(\varepsilon^{-1/2}\log \varepsilon^{-1})$ and ${\cal O}(\log \varepsilon^{-1})$ for finding an $\varepsilon$-residual solution of an unconstrained convex and strongly convex optimization problem, respectively. We then consider constrained convex optimization with LLCG and propose an first-order proximal augmented Lagrangian method for solving it by applying one of our proposed APG methods to approximately solve a sequence of proximal augmented Lagrangian subproblems. The resulting method is equipped with a verifiable termination criterion and enjoys an operation complexity of ${\cal O}(\varepsilon^{-1}\log \varepsilon^{-1})$ and ${\cal O}(\varepsilon^{-1/2}\log \varepsilon^{-1})$ for finding an $\varepsilon$-KKT solution of a constrained convex and strongly convex optimization problem, respectively. All the proposed methods in this paper are parameter-free or almost parameter-free except that the knowledge on convexity parameter is required. To the best of our knowledge, no prior studies were conducted to investigate accelerated first-order methods with complexity guarantees for convex optimization with LLCG. All the complexity results obtained in this paper are entirely new.  ( 2 min )
    Positive Unlabeled Contrastive Learning. (arXiv:2206.01206v1 [cs.LG])
    Self-supervised pretraining on unlabeled data followed by supervised finetuning on labeled data is a popular paradigm for learning from limited labeled examples. In this paper, we investigate and extend this paradigm to the classical positive unlabeled (PU) setting - the weakly supervised task of learning a binary classifier only using a few labeled positive examples and a set of unlabeled samples. We propose a novel PU learning objective positive unlabeled Noise Contrastive Estimation (puNCE) that leverages the available explicit (from labeled samples) and implicit (from unlabeled samples) supervision to learn useful representations from positive unlabeled input data. The underlying idea is to assign each training sample an individual weight; labeled positives are given unit weight; unlabeled samples are duplicated, one copy is labeled positive and the other as negative with weights $\pi$ and $(1-\pi)$ where $\pi$ denotes the class prior. Extensive experiments across vision and natural language tasks reveal that puNCE consistently improves over existing unsupervised and supervised contrastive baselines under limited supervision.  ( 2 min )
    Is an encoder within reach?. (arXiv:2206.01552v1 [cs.LG])
    The encoder network of an autoencoder is an approximation of the nearest point projection onto the manifold spanned by the decoder. A concern with this approximation is that, while the output of the encoder is always unique, the projection can possibly have infinitely many values. This implies that the latent representations learned by the autoencoder can be misleading. Borrowing from geometric measure theory, we introduce the idea of using the reach of the manifold spanned by the decoder to determine if an optimal encoder exists for a given dataset and decoder. We develop a local generalization of this reach and propose a numerical estimator thereof. We demonstrate that this allows us to determine which observations can be expected to have a unique, and thereby trustworthy, latent representation. As our local reach estimator is differentiable, we investigate its usage as a regularizer and show that this leads to learned manifolds for which projections are more often unique than without regularization.  ( 2 min )
    Detecting the Severity of Major Depressive Disorder from Speech: A Novel HARD-Training Methodology. (arXiv:2206.01542v1 [cs.SD])
    Major Depressive Disorder (MDD) is a common worldwide mental health issue with high associated socioeconomic costs. The prediction and automatic detection of MDD can, therefore, make a huge impact on society. Speech, as a non-invasive, easy to collect signal, is a promising marker to aid the diagnosis and assessment of MDD. In this regard, speech samples were collected as part of the Remote Assessment of Disease and Relapse in Major Depressive Disorder (RADAR-MDD) research programme. RADAR-MDD was an observational cohort study in which speech and other digital biomarkers were collected from a cohort of individuals with a history of MDD in Spain, United Kingdom and the Netherlands. In this paper, the RADAR-MDD speech corpus was taken as an experimental framework to test the efficacy of a Sequence-to-Sequence model with a local attention mechanism in a two-class depression severity classification paradigm. Additionally, a novel training method, HARD-Training, is proposed. It is a methodology based on the selection of more ambiguous samples for the model training, and inspired by the curriculum learning paradigm. HARD-Training was found to consistently improve - with an average increment of 8.6% - the performance of our classifiers for both of two speech elicitation tasks used and each collection site of the RADAR-MDD speech corpus. With this novel methodology, our Sequence-to-Sequence model was able to effectively detect MDD severity regardless of language. Finally, recognising the need for greater awareness of potential algorithmic bias, we conduct an additional analysis of our results separately for each gender.  ( 2 min )
    A High-Performance Customer Churn Prediction System based on Self-Attention. (arXiv:2206.01523v1 [cs.LG])
    Customer churn prediction is a challenging domain of research that contributes to customer retention strategy. The predictive performance of existing machine learning models, which are often adopted by churn communities, appear to be at a bottleneck, partly due to models' poor feature extraction capability. Therefore, a novel algorithm, a hybrid neural network with self-attention enhancement (HNNSAE), is proposed in this paper to improve the efficiency of feature screening and feature extraction, consequently improving the model's predictive performance. This model consists of three main blocks. The first block is the entity embedding layer, which is employed to process the categorical variables transformed into 0-1 code. The second block is the feature extractor, which extracts the significant features through the multi-head self-attention mechanism. In addition, to improve the feature extraction effect, we stack the residual connection neural network on multi-head self-attention modules. The third block is a classifier, which is a three-layer multilayer perceptron. This work conducts experiments on publicly available dataset related to commercial bank customers. The result demonstrates that HNNSAE significantly outperforms the other Individual Machine Learning (IML), Ensemble Machine Learning (EML), and Deep Learning (DL) methods tested in this paper. Furthermore, we compare the performance of the feature extractor proposed in this paper with that of other three feature extractors and find that the method proposed in this paper significantly outperforms other methods. In addition, four hypotheses about model prediction performance and overfitting risk are tested on the publicly available dataset.  ( 2 min )
    Rethinking and Scaling Up Graph Contrastive Learning: An Extremely Efficient Approach with Group Discrimination. (arXiv:2206.01535v1 [cs.LG])
    Graph contrastive learning (GCL) alleviates the heavy reliance on label information for graph representation learning (GRL) via self-supervised learning schemes. The core idea is to learn by maximising mutual information for similar instances, which requires similarity computation between two node instances. However, this operation can be computationally expensive. For example, the time complexity of two commonly adopted contrastive loss functions (i.e., InfoNCE and JSD estimator) for a node is $O(ND)$ and $O(D)$, respectively, where $N$ is the number of nodes, and $D$ is the embedding dimension. Additionally, GCL normally requires a large number of training epochs to be well-trained on large-scale datasets. Inspired by an observation of a technical defect (i.e., inappropriate usage of Sigmoid function) commonly used in two representative GCL works, DGI and MVGRL, we revisit GCL and introduce a new learning paradigm for self-supervised GRL, namely, Group Discrimination (GD), and propose a novel GD-based method called Graph Group Discrimination (GGD). Instead of similarity computation, GGD directly discriminates two groups of summarised node instances with a simple binary cross-entropy loss. As such, GGD only requires $O(1)$ for loss computation of a node. In addition, GGD requires much fewer training epochs to obtain competitive performance compared with GCL methods on large-scale datasets. These two advantages endow GGD with the very efficient property. Extensive experiments show that GGD outperforms state-of-the-art self-supervised methods on 8 datasets. In particular, GGD can be trained in 0.18 seconds (6.44 seconds including data preprocessing) on ogbn-arxiv, which is orders of magnitude (10,000+ faster than GCL baselines} while consuming much less memory. Trained with 9 hours on ogbn-papers100M with billion edges, GGD outperforms its GCL counterparts in both accuracy and efficiency.  ( 2 min )
    Understanding deep learning via decision boundary. (arXiv:2206.01515v1 [cs.LG])
    This paper discovers that the neural network with lower decision boundary (DB) variability has better generalizability. Two new notions, algorithm DB variability and $(\epsilon, \eta)$-data DB variability, are proposed to measure the decision boundary variability from the algorithm and data perspectives. Extensive experiments show significant negative correlations between the decision boundary variability and the generalizability. From the theoretical view, two lower bounds based on algorithm DB variability are proposed and do not explicitly depend on the sample size. We also prove an upper bound of order $\mathcal{O}\left(\frac{1}{\sqrt{m}}+\epsilon+\eta\log\frac{1}{\eta}\right)$ based on data DB variability. The bound is convenient to estimate without the requirement of labels, and does not explicitly depend on the network size which is usually prohibitively large in deep learning.  ( 2 min )
    Latent Topology Induction for Understanding Contextualized Representations. (arXiv:2206.01512v1 [cs.CL])
    In this work, we study the representation space of contextualized embeddings and gain insight into the hidden topology of large language models. We show there exists a network of latent states that summarize linguistic properties of contextualized representations. Instead of seeking alignments to existing well-defined annotations, we infer this latent network in a fully unsupervised way using a structured variational autoencoder. The induced states not only serve as anchors that mark the topology (neighbors and connectivity) of the representation manifold but also reveal the internal mechanism of encoding sentences. With the induced network, we: (1). decompose the representation space into a spectrum of latent states which encode fine-grained word meanings with lexical, morphological, syntactic and semantic information; (2). show state-state transitions encode rich phrase constructions and serve as the backbones of the latent space. Putting the two together, we show that sentences are represented as a traversal over the latent network where state-state transition chains encode syntactic templates and state-word emissions fill in the content. We demonstrate these insights with extensive experiments and visualizations.  ( 2 min )
    Canonical convolutional neural networks. (arXiv:2206.01509v1 [cs.LG])
    We introduce canonical weight normalization for convolutional neural networks. Inspired by the canonical tensor decomposition, we express the weight tensors in so-called canonical networks as scaled sums of outer vector products. In particular, we train network weights in the decomposed form, where scale weights are optimized separately for each mode. Additionally, similarly to weight normalization, we include a global scaling parameter. We study the initialization of the canonical form by running the power method and by drawing randomly from Gaussian or uniform distributions. Our results indicate that we can replace the power method with cheaper initializations drawn from standard distributions. The canonical re-parametrization leads to competitive normalization performance on the MNIST, CIFAR10, and SVHN data sets. Moreover, the formulation simplifies network compression. Once training has converged, the canonical form allows convenient model-compression by truncating the parameter sums.  ( 2 min )
    Can Hybrid Geometric Scattering Networks Help Solve the Maximal Clique Problem?. (arXiv:2206.01506v1 [cs.LG])
    We propose a geometric scattering-based graph neural network (GNN) for approximating solutions of the NP-hard maximal clique (MC) problem. We construct a loss function with two terms, one which encourages the network to find a large set of nodes and the other which acts as a surrogate for the constraint that the nodes form a clique. We then use this loss to train a novel GNN architecture that outputs a vector representing the probability for each node to be part of the MC and apply a rule-based decoder to make our final prediction. The incorporation of the scattering transform alleviates the so-called oversmoothing problem that is often encountered in GNNs and would degrade the performance of our proposed setup. Our empirical results demonstrate that our method outperforms representative GNN baselines in terms of solution accuracy and inference speed as well as conventional solvers like GUROBI with limited time budgets.  ( 2 min )
    Transferring Studies Across Embodiments: A Case Study in Confusion Detection. (arXiv:2206.01493v1 [cs.HC])
    Human-robot studies are expensive to conduct and difficult to control, and as such researchers sometimes turn to human-avatar interaction in the hope of faster and cheaper data collection that can be transferred to the robot domain. In terms of our work, we are particularly interested in the challenge of detecting and modelling user confusion in interaction, and as part of this research programme, we conducted situated dialogue studies to investigate users' reactions in confusing scenarios that we give in both physical and virtual environments. In this paper, we present a combined review of these studies and the results that we observed across these two embodiments. For the physical embodiment, we used a Pepper Robot, while for the virtual modality, we used a 3D avatar. Our study shows that despite attitudinal differences and technical control limitations, there were a number of similarities detected in user behaviour and self-reporting results across embodiment options. This work suggests that, while avatar interaction is no true substitute for robot interaction studies, sufficient care in study design may allow well executed human-avatar studies to supplement more challenging human-robot studies.  ( 2 min )
    Causality Learning With Wasserstein Generative Adversarial Networks. (arXiv:2206.01496v1 [cs.LG])
    Conventional methods for causal structure learning from data face significant challenges due to combinatorial search space. Recently, the problem has been formulated into a continuous optimization framework with an acyclicity constraint to learn Directed Acyclic Graphs (DAGs). Such a framework allows the utilization of deep generative models for causal structure learning to better capture the relations between data sample distributions and DAGs. However, so far no study has experimented with the use of Wasserstein distance in the context of causal structure learning. Our model named DAG-WGAN combines the Wasserstein-based adversarial loss with an acyclicity constraint in an auto-encoder architecture. It simultaneously learns causal structures while improving its data generation capability. We compare the performance of DAG-WGAN with other models that do not involve the Wasserstein metric in order to identify its contribution to causal structure learning. Our model performs better with high cardinality data according to our experiments.  ( 2 min )
    Finding Rule-Interpretable Non-Negative Data Representation. (arXiv:2206.01483v1 [cs.LG])
    Non-negative Matrix Factorization (NMF) is an intensively used technique for obtaining parts-based, lower dimensional and non-negative representation of non-negative data. It is a popular method in different research fields. Scientists performing research in the fields of biology, medicine and pharmacy often prefer NMF over other dimensionality reduction approaches (such as PCA) because the non-negativity of the approach naturally fits the characteristics of the domain problem and its result is easier to analyze and understand. Despite these advantages, it still can be hard to get exact characterization and interpretation of the NMF's resulting latent factors due to their numerical nature. On the other hand, rule-based approaches are often considered more interpretable but lack the parts-based interpretation. In this work, we present a version of the NMF approach that merges rule-based descriptions with advantages of part-based representation offered by the NMF approach. Given the numerical input data with non-negative entries and a set of rules with high entity coverage, the approach creates the lower-dimensional non-negative representation of the input data in such a way that its factors are described by the appropriate subset of the input rules. In addition to revealing important attributes for latent factors, it allows analyzing relations between these attributes and provides the exact numerical intervals or categorical values they take. The proposed approach provides numerous advantages in tasks such as focused embedding or performing supervised multi-label NMF.  ( 2 min )
    Stochastic gradient descent introduces an effective landscape-dependent regularization favoring flat solutions. (arXiv:2206.01246v1 [cond-mat.dis-nn])
    Generalization is one of the most important problems in deep learning (DL). In the overparameterized regime in neural networks, there exist many low-loss solutions that fit the training data equally well. The key question is which solution is more generalizable. Empirical studies showed a strong correlation between flatness of the loss landscape at a solution and its generalizability, and stochastic gradient descent (SGD) is crucial in finding the flat solutions. To understand how SGD drives the learning system to flat solutions, we construct a simple model whose loss landscape has a continuous set of degenerate (or near degenerate) minima. By solving the Fokker-Planck equation of the underlying stochastic learning dynamics, we show that due to its strong anisotropy the SGD noise introduces an additional effective loss term that decreases with flatness and has an overall strength that increases with the learning rate and batch-to-batch variation. We find that the additional landscape-dependent SGD-loss breaks the degeneracy and serves as an effective regularization for finding flat solutions. Furthermore, a stronger SGD noise shortens the convergence time to the flat solutions. However, we identify an upper bound for the SGD noise beyond which the system fails to converge. Our results not only elucidate the role of SGD for generalization they may also have important implications for hyperparameter selection for learning efficiently without divergence.  ( 2 min )
    Snow Mountain: Dataset of Audio Recordings of The Bible in Low Resource Languages. (arXiv:2206.01205v1 [eess.AS])
    Automatic Speech Recognition (ASR) has increasing utility in the modern world. There are a many ASR models available for languages with large amounts of training data like English. However, low-resource languages are poorly represented. In response we create and release an open-licensed and formatted dataset of audio recordings of the Bible in low-resource northern Indian languages. We setup multiple experimental splits and train and analyze two competitive ASR models to serve as the baseline for future research using this data.  ( 2 min )
    Lottery Tickets on a Data Diet: Finding Initializations with Sparse Trainable Networks. (arXiv:2206.01278v1 [cs.LG])
    A striking observation about iterative magnitude pruning (IMP; Frankle et al. 2020) is that $\unicode{x2014}$ after just a few hundred steps of dense training $\unicode{x2014}$ the method can find a sparse sub-network that can be trained to the same accuracy as the dense network. However, the same does not hold at step 0, i.e. random initialization. In this work, we seek to understand how this early phase of pre-training leads to a good initialization for IMP both through the lens of the data distribution and the loss landscape geometry. Empirically we observe that, holding the number of pre-training iterations constant, training on a small fraction of (randomly chosen) data suffices to obtain an equally good initialization for IMP. We additionally observe that by pre-training only on "easy" training data, we can decrease the number of steps necessary to find a good initialization for IMP compared to training on the full dataset or a randomly chosen subset. Finally, we identify novel properties of the loss landscape of dense networks that are predictive of IMP performance, showing in particular that more examples being linearly mode connected in the dense network correlates well with good initializations for IMP. Combined, these results provide new insight into the role played by the early phase training in IMP.  ( 2 min )
    Exponential Separations in Symmetric Neural Networks. (arXiv:2206.01266v1 [cs.LG])
    In this work we demonstrate a novel separation between symmetric neural network architectures. Specifically, we consider the Relational Network~\parencite{santoro2017simple} architecture as a natural generalization of the DeepSets~\parencite{zaheer2017deep} architecture, and study their representational gap. Under the restriction to analytic activation functions, we construct a symmetric function acting on sets of size $N$ with elements in dimension $D$, which can be efficiently approximated by the former architecture, but provably requires width exponential in $N$ and $D$ for the latter.  ( 2 min )
    Entangled Residual Mappings. (arXiv:2206.01261v1 [cs.LG])
    Residual mappings have been shown to perform representation learning in the first layers and iterative feature refinement in higher layers. This interplay, combined with their stabilizing effect on the gradient norms, enables them to train very deep networks. In this paper, we take a step further and introduce entangled residual mappings to generalize the structure of the residual connections and evaluate their role in iterative learning representations. An entangled residual mapping replaces the identity skip connections with specialized entangled mappings such as orthogonal, sparse, and structural correlation matrices that share key attributes (eigenvalues, structure, and Jacobian norm) with identity mappings. We show that while entangled mappings can preserve the iterative refinement of features across various deep models, they influence the representation learning process in convolutional networks differently than attention-based models and recurrent neural networks. In general, we find that for CNNs and Vision Transformers entangled sparse mapping can help generalization while orthogonal mappings hurt performance. For recurrent networks, orthogonal residual mappings form an inductive bias for time-variant sequences, which degrades accuracy on time-invariant tasks.  ( 2 min )
    Deep Learning Architecture Based Approach For 2D-Simulation of Microwave Plasma Interaction. (arXiv:2206.01263v1 [physics.comp-ph])
    This paper presents a convolutional neural network (CNN)-based deep learning model, inspired from UNet with series of encoder and decoder units with skip connections, for the simulation of microwave-plasma interaction. The microwave propagation characteristics in complex plasma medium pertaining to transmission, absorption and reflection primarily depends on the ratio of electromagnetic (EM) wave frequency and electron plasma frequency, and the plasma density profile. The scattering of a plane EM wave with fixed frequency (1 GHz) and amplitude incident on a plasma medium with different gaussian density profiles (in the range of $1\times 10^{17}-1\times 10^{22}{m^{-3}}$) have been considered. The training data associated with microwave-plasma interaction has been generated using 2D-FDTD (Finite Difference Time Domain) based simulations. The trained deep learning model is then used to reproduce the scattered electric field values for the 1GHz incident microwave on different plasma profiles with error margin of less than 2\%. We propose a complete deep learning (DL) based pipeline to train, validate and evaluate the model. We compare the results of the network, using various metrics like SSIM index, average percent error and mean square error, with the physical data obtained from well-established FDTD based EM solvers. To the best of our knowledge, this is the first effort towards exploring a DL based approach for the simulation of complex microwave plasma interaction. The deep learning technique proposed in this work is significantly fast as compared to the existing computational techniques, and can be used as a new, prospective and alternative computational approach for investigating microwave-plasma interaction in a real time scenario.  ( 2 min )
  • Open

    Rashomon Capacity: A Metric for Predictive Multiplicity in Probabilistic Classification. (arXiv:2206.01295v1 [cs.LG])
    Predictive multiplicity occurs when classification models with nearly indistinguishable average performances assign conflicting predictions to individual samples. When used for decision-making in applications of consequence (e.g., lending, education, criminal justice), models developed without regard for predictive multiplicity may result in unjustified and arbitrary decisions for specific individuals. We introduce a new measure of predictive multiplicity in probabilistic classification called Rashomon Capacity. Prior metrics for predictive multiplicity focus on classifiers that output thresholded (i.e., 0-1) predicted classes. In contrast, Rashomon Capacity applies to probabilistic classifiers, capturing more nuanced score variations for individual samples. We provide a rigorous derivation for Rashomon Capacity, argue its intuitive appeal, and demonstrate how to estimate it in practice. We show that Rashomon Capacity yields principled strategies for disclosing conflicting models to stakeholders. Our numerical experiments illustrate how Rashomon Capacity captures predictive multiplicity in various datasets and learning models, including neural networks. The tools introduced in this paper can help data scientists measure, report, and ultimately resolve predictive multiplicity prior to model deployment.
    Regularization-wise double descent: Why it occurs and how to eliminate it. (arXiv:2206.01378v1 [cs.LG])
    The risk of overparameterized models, in particular deep neural networks, is often double-descent shaped as a function of the model size. Recently, it was shown that the risk as a function of the early-stopping time can also be double-descent shaped, and this behavior can be explained as a super-position of bias-variance tradeoffs. In this paper, we show that the risk of explicit L2-regularized models can exhibit double descent behavior as a function of the regularization strength, both in theory and practice. We find that for linear regression, a double descent shaped risk is caused by a superposition of bias-variance tradeoffs corresponding to different parts of the model and can be mitigated by scaling the regularization strength of each part appropriately. Motivated by this result, we study a two-layer neural network and show that double descent can be eliminated by adjusting the regularization strengths for the first and second layer. Lastly, we study a 5-layer CNN and ResNet-18 trained on CIFAR-10 with label noise, and CIFAR-100 without label noise, and demonstrate that all exhibit double descent behavior as a function of the regularization strength.
    Rate-Optimal Online Convex Optimization in Adaptive Linear Control. (arXiv:2206.01426v1 [cs.LG])
    We consider the problem of controlling an unknown linear dynamical system under adversarially changing convex costs and full feedback of both the state and cost function. We present the first computationally-efficient algorithm that attains an optimal $\smash{\sqrt{T}}$-regret rate compared to the best stabilizing linear controller in hindsight, while avoiding stringent assumptions on the costs such as strong convexity. Our approach is based on a careful design of non-convex lower confidence bounds for the online costs, and uses a novel technique for computationally-efficient regret minimization of these bounds that leverages their particular non-convex structure.
    Optimal Activation Functions for the Random Features Regression Model. (arXiv:2206.01332v1 [stat.ML])
    The asymptotic mean squared test error and sensitivity of the Random Features Regression model (RFR) have been recently studied. We build on this work and identify in closed-form the family of Activation Functions (AFs) that minimize a combination of the test error and sensitivity of the RFR under different notions of functional parsimony. We find scenarios under which the optimal AFs are linear, saturated linear functions, or expressible in terms of Hermite polynomials. Finally, we show how using optimal AFs impacts well-established properties of the RFR model, such as its double descent curve, and the dependency of its optimal regularization parameter on the observation noise level.
    Excess risk analysis for epistemic uncertainty with application to variational inference. (arXiv:2206.01606v1 [stat.ML])
    We analyze the epistemic uncertainty (EU) of supervised learning in Bayesian inference by focusing on the excess risk. Existing analysis is limited to the Bayesian setting, which assumes a correct model and exact Bayesian posterior distribution. Thus we cannot apply the existing theory to modern Bayesian algorithms, such as variational inference. To address this, we present a novel EU analysis in the frequentist setting, where data is generated from an unknown distribution. We show a relation between the generalization ability and the widely used EU measurements, such as the variance and entropy of the predictive distribution. Then we show their convergence behaviors theoretically. Finally, we propose new variational inference that directly controls the prediction and EU evaluation performances based on the PAC-Bayesian theory. Numerical experiments show that our algorithm significantly improves the EU evaluation over the existing methods.
    The geometry of integration in text classification RNNs. (arXiv:2010.15114v2 [cs.LG] UPDATED)
    Despite the widespread application of recurrent neural networks (RNNs) across a variety of tasks, a unified understanding of how RNNs solve these tasks remains elusive. In particular, it is unclear what dynamical patterns arise in trained RNNs, and how those patterns depend on the training dataset or task. This work addresses these questions in the context of a specific natural language processing task: text classification. Using tools from dynamical systems analysis, we study recurrent networks trained on a battery of both natural and synthetic text classification tasks. We find the dynamics of these trained RNNs to be both interpretable and low-dimensional. Specifically, across architectures and datasets, RNNs accumulate evidence for each class as they process the text, using a low-dimensional attractor manifold as the underlying mechanism. Moreover, the dimensionality and geometry of the attractor manifold are determined by the structure of the training dataset; in particular, we describe how simple word-count statistics computed on the training dataset can be used to predict these properties. Our observations span multiple architectures and datasets, reflecting a common mechanism RNNs employ to perform text classification. To the degree that integration of evidence towards a decision is a common computational primitive, this work lays the foundation for using dynamical systems techniques to study the inner workings of RNNs.
    The price of unfairness in linear bandits with biased feedback. (arXiv:2203.09784v2 [math.ST] UPDATED)
    In this paper, we study the problem of fair sequential decision making with biased linear bandit feedback. At each round, a player selects an action described by a covariate and by a sensitive attribute. The perceived reward is a linear combination of the covariates of the chosen action, but the player only observes a biased evaluation of this reward, depending on the sensitive attribute. To characterize the difficulty of this problem, we design a phased elimination algorithm that corrects the unfair evaluations, and establish upper bounds on its regret. We show that the worst-case regret is smaller than $\mathcal{O}(\kappa_*^{1/3}\log(T)^{1/3}T^{2/3})$, where $\kappa_*$ is an explicit geometrical constant characterizing the difficulty of bias estimation. We prove lower bounds on the worst-case regret for some sets of actions showing that this rate is tight up to a possible sub-logarithmic factor. We also derive gap-dependent upper bounds on the regret, and matching lower bounds for some problem instance.Interestingly, these results reveal a transition between a regime where the problem is as difficult as its unbiased counterpart, and a regime where it can be much harder.
    Equipping Black-Box Policies with Model-Based Advice for Stable Nonlinear Control. (arXiv:2206.01341v1 [cs.LG])
    Machine-learned black-box policies are ubiquitous for nonlinear control problems. Meanwhile, crude model information is often available for these problems from, e.g., linear approximations of nonlinear dynamics. We study the problem of equipping a black-box control policy with model-based advice for nonlinear control on a single trajectory. We first show a general negative result that a naive convex combination of a black-box policy and a linear model-based policy can lead to instability, even if the two policies are both stabilizing. We then propose an adaptive $\lambda$-confident policy, with a coefficient $\lambda$ indicating the confidence in a black-box policy, and prove its stability. With bounded nonlinearity, in addition, we show that the adaptive $\lambda$-confident policy achieves a bounded competitive ratio when a black-box policy is near-optimal. Finally, we propose an online learning approach to implement the adaptive $\lambda$-confident policy and verify its efficacy in case studies about the CartPole problem and a real-world electric vehicle (EV) charging problem with data bias due to COVID-19.
    Pay attention to your loss: understanding misconceptions about 1-Lipschitz neural networks. (arXiv:2104.05097v5 [cs.LG] UPDATED)
    Lipschitz constrained networks have gathered considerable attention in deep learning community, with usages ranging from Wasserstein distance estimation to the training of certifiably robust classifiers. However they remain commonly considered as less accurate, and their properties in learning are still not fully understood. In this paper we clarify the matter: when it comes to classification 1-Lipschitz neural networks enjoy several advantages over their unconstrained counterpart. First, we show that these networks are as accurate as classical ones, and can fit arbitrarily difficult boundaries. Then, relying on a robustness metric which reflects operational needs we characterize the most robust classifier: the WGAN discriminator. Next, we show that 1-Lipschitz neural networks generalize well under milder assumptions. Finally, we show that hyper-parameters of the loss are crucial for controlling the accuracy-robustness trade-off. We conclude that they exhibit appealing properties to pave the way toward provably accurate, and provably robust neural networks.
    Lottery Tickets on a Data Diet: Finding Initializations with Sparse Trainable Networks. (arXiv:2206.01278v1 [cs.LG])
    A striking observation about iterative magnitude pruning (IMP; Frankle et al. 2020) is that $\unicode{x2014}$ after just a few hundred steps of dense training $\unicode{x2014}$ the method can find a sparse sub-network that can be trained to the same accuracy as the dense network. However, the same does not hold at step 0, i.e. random initialization. In this work, we seek to understand how this early phase of pre-training leads to a good initialization for IMP both through the lens of the data distribution and the loss landscape geometry. Empirically we observe that, holding the number of pre-training iterations constant, training on a small fraction of (randomly chosen) data suffices to obtain an equally good initialization for IMP. We additionally observe that by pre-training only on "easy" training data, we can decrease the number of steps necessary to find a good initialization for IMP compared to training on the full dataset or a randomly chosen subset. Finally, we identify novel properties of the loss landscape of dense networks that are predictive of IMP performance, showing in particular that more examples being linearly mode connected in the dense network correlates well with good initializations for IMP. Combined, these results provide new insight into the role played by the early phase training in IMP.
    Sequential Permutation Testing of Random Forest Variable Importance Measures. (arXiv:2206.01284v1 [stat.ME])
    Hypothesis testing of random forest (RF) variable importance measures (VIMP) remains the subject of ongoing research. Among recent developments, heuristic approaches to parametric testing have been proposed whose distributional assumptions are based on empirical evidence. Other formal tests under regularity conditions were derived analytically. However, these approaches can be computationally expensive or even practically infeasible. This problem also occurs with non-parametric permutation tests, which are, however, distribution-free and can generically be applied to any type of RF and VIMP. Embracing this advantage, it is proposed here to use sequential permutation tests and sequential p-value estimation to reduce the high computational costs associated with conventional permutation tests. The popular and widely used permutation VIMP serves as a practical and relevant application example. The results of simulation studies confirm that the theoretical properties of the sequential tests apply, that is, the type-I error probability is controlled at a nominal level and a high power is maintained with considerably fewer permutations needed in comparison to conventional permutation testing. The numerical stability of the methods is investigated in two additional application studies. In summary, theoretically sound sequential permutation testing of VIMP is possible at greatly reduced computational costs. Recommendations for application are given. A respective implementation is provided through the accompanying R package $rfvimptest$. The approach can also be easily applied to any kind of prediction model.
    Generalization for multiclass classification with overparameterized linear models. (arXiv:2206.01399v1 [cs.LG])
    Via an overparameterized linear model with Gaussian features, we provide conditions for good generalization for multiclass classification of minimum-norm interpolating solutions in an asymptotic setting where both the number of underlying features and the number of classes scale with the number of training points. The survival/contamination analysis framework for understanding the behavior of overparameterized learning problems is adapted to this setting, revealing that multiclass classification qualitatively behaves like binary classification in that, as long as there are not too many classes (made precise in the paper), it is possible to generalize well even in some settings where the corresponding regression tasks would not generalize. Besides various technical challenges, it turns out that the key difference from the binary classification setting is that there are relatively fewer positive training examples of each class in the multiclass setting as the number of classes increases, making the multiclass problem "harder" than the binary one.
    HEX: Human-in-the-loop Explainability via Deep Reinforcement Learning. (arXiv:2206.01343v1 [cs.LG])
    The use of machine learning (ML) models in decision-making contexts, particularly those used in high-stakes decision-making, are fraught with issue and peril since a person - not a machine - must ultimately be held accountable for the consequences of the decisions made using such systems. Machine learning explainability (MLX) promises to provide decision-makers with prediction-specific rationale, assuring them that the model-elicited predictions are made for the right reasons and are thus reliable. Few works explicitly consider this key human-in-the-loop (HITL) component, however. In this work we propose HEX, a human-in-the-loop deep reinforcement learning approach to MLX. HEX incorporates 0-distrust projection to synthesize decider specific explanation-providing policies from any arbitrary classification model. HEX is also constructed to operate in limited or reduced training data scenarios, such as those employing federated learning. Our formulation explicitly considers the decision boundary of the ML model in question, rather than the underlying training data, which is a shortcoming of many model-agnostic MLX methods. Our proposed methods thus synthesize HITL MLX policies that explicitly capture the decision boundary of the model in question for use in limited data scenarios.
    Hybrid Models for Mixed Variables in Bayesian Optimization. (arXiv:2206.01409v1 [cs.LG])
    We systematically describe the problem of simultaneous surrogate modeling of mixed variables (i.e., continuous, integer and categorical variables) in the Bayesian optimization (BO) context. We provide a unified hybrid model using both Monte-Carlo tree search (MCTS) and Gaussian processes (GP) that encompasses and generalizes multiple state-of-the-art mixed BO surrogates. Based on the architecture, we propose applying a new dynamic model selection criterion among novel candidate families of covariance kernels, including non-stationary kernels and associated families. Different benchmark problems are studied and presented to support the superiority of our model, along with results highlighting the effectiveness of our method compared to most state-of-the-art mixed-variable methods in BO.
    Approximate Network Motif Mining Via Graph Learning. (arXiv:2206.01008v1 [cs.LG] CROSS LISTED)
    Frequent and structurally related subgraphs, also known as network motifs, are valuable features of many graph datasets. However, the high computational complexity of identifying motif sets in arbitrary datasets (motif mining) has limited their use in many real-world datasets. By automatically leveraging statistical properties of datasets, machine learning approaches have shown promise in several tasks with combinatorial complexity and are therefore a promising candidate for network motif mining. In this work we seek to facilitate the development of machine learning approaches aimed at motif mining. We propose a formulation of the motif mining problem as a node labelling task. In addition, we build benchmark datasets and evaluation metrics which test the ability of models to capture different aspects of motif discovery such as motif number, size, topology, and scarcity. Next, we propose MotiFiesta, a first attempt at solving this problem in a fully differentiable manner with promising results on challenging baselines. Finally, we demonstrate through MotiFiesta that this learning setting can be applied simultaneously to general-purpose data mining and interpretable feature extraction for graph classification tasks.
    Sample-Efficient Reinforcement Learning of Partially Observable Markov Games. (arXiv:2206.01315v1 [cs.LG])
    This paper considers the challenging tasks of Multi-Agent Reinforcement Learning (MARL) under partial observability, where each agent only sees her own individual observations and actions that reveal incomplete information about the underlying state of system. This paper studies these tasks under the general model of multiplayer general-sum Partially Observable Markov Games (POMGs), which is significantly larger than the standard model of Imperfect Information Extensive-Form Games (IIEFGs). We identify a rich subclass of POMGs -- weakly revealing POMGs -- in which sample-efficient learning is tractable. In the self-play setting, we prove that a simple algorithm combining optimism and Maximum Likelihood Estimation (MLE) is sufficient to find approximate Nash equilibria, correlated equilibria, as well as coarse correlated equilibria of weakly revealing POMGs, in a polynomial number of samples when the number of agents is small. In the setting of playing against adversarial opponents, we show that a variant of our optimistic MLE algorithm is capable of achieving sublinear regret when being compared against the optimal maximin policies. To our best knowledge, this work provides the first line of sample-efficient results for learning POMGs.
    Reinforcement Learning with Fast Stabilization in Linear Dynamical Systems. (arXiv:2007.12291v2 [cs.LG] UPDATED)
    In this work, we study model-based reinforcement learning (RL) in unknown stabilizable linear dynamical systems. When learning a dynamical system, one needs to stabilize the unknown dynamics in order to avoid system blow-ups. We propose an algorithm that certifies fast stabilization of the underlying system by effectively exploring the environment with an improved exploration strategy. We show that the proposed algorithm attains $\tilde{\mathcal{O}}(\sqrt{T})$ regret after $T$ time steps of agent-environment interaction. We also show that the regret of the proposed algorithm has only a polynomial dependence in the problem dimensions, which gives an exponential improvement over the prior methods. Our improved exploration method is simple, yet efficient, and it combines a sophisticated exploration policy in RL with an isotropic exploration strategy to achieve fast stabilization and improved regret. We empirically demonstrate that the proposed algorithm outperforms other popular methods in several adaptive control tasks.
    On the Benefits of Large Learning Rates for Kernel Methods. (arXiv:2202.13733v2 [stat.ML] UPDATED)
    This paper studies an intriguing phenomenon related to the good generalization performance of estimators obtained by using large learning rates within gradient descent algorithms. First observed in the deep learning literature, we show that a phenomenon can be precisely characterized in the context of kernel methods, even though the resulting optimization problem is convex. Specifically, we consider the minimization of a quadratic objective in a separable Hilbert space, and show that with early stopping, the choice of learning rate influences the spectral decomposition of the obtained solution on the Hessian's eigenvectors. This extends an intuition described by Nakkiran (2020) on a two-dimensional toy problem to realistic learning scenarios such as kernel ridge regression. While large learning rates may be proven beneficial as soon as there is a mismatch between the train and test objectives, we further explain why it already occurs in classification tasks without assuming any particular mismatch between train and test data distributions.
    Prescriptive maintenance with causal machine learning. (arXiv:2206.01562v1 [econ.GN])
    Machine maintenance is a challenging operational problem, where the goal is to plan sufficient preventive maintenance to avoid machine failures and overhauls. Maintenance is often imperfect in reality and does not make the asset as good as new. Although a variety of imperfect maintenance policies have been proposed in the literature, these rely on strong assumptions regarding the effect of maintenance on the machine's condition, assuming the effect is (1) deterministic or governed by a known probability distribution, and (2) machine-independent. This work proposes to relax both assumptions by learning the effect of maintenance conditional on a machine's characteristics from observational data on similar machines using existing methodologies for causal inference. By predicting the maintenance effect, we can estimate the number of overhauls and failures for different levels of maintenance and, consequently, optimize the preventive maintenance frequency to minimize the total estimated cost. We validate our proposed approach using real-life data on more than 4,000 maintenance contracts from an industrial partner. Empirical results show that our novel, causal approach accurately predicts the maintenance effect and results in individualized maintenance schedules that are more accurate and cost-effective than supervised or non-individualized approaches.
    Instance-dependent Label-noise Learning under a Structural Causal Model. (arXiv:2109.02986v3 [stat.ML] UPDATED)
    Label noise will degenerate the performance of deep learning algorithms because deep neural networks easily overfit label errors. Let X and Y denote the instance and clean label, respectively. When Y is a cause of X, according to which many datasets have been constructed, e.g., SVHN and CIFAR, the distributions of P(X) and P(Y|X) are entangled. This means that the unsupervised instances are helpful to learn the classifier and thus reduce the side effect of label noise. However, it remains elusive on how to exploit the causal information to handle the label noise problem. In this paper, by leveraging a structural causal model, we propose a novel generative approach for instance-dependent label-noise learning. In particular, we show that properly modeling the instances will contribute to the identifiability of the label noise transition matrix and thus lead to a better classifier. Empirically, our method outperforms all state-of-the-art methods on both synthetic and real-world label-noise datasets.
    Understanding deep learning via decision boundary. (arXiv:2206.01515v1 [cs.LG])
    This paper discovers that the neural network with lower decision boundary (DB) variability has better generalizability. Two new notions, algorithm DB variability and $(\epsilon, \eta)$-data DB variability, are proposed to measure the decision boundary variability from the algorithm and data perspectives. Extensive experiments show significant negative correlations between the decision boundary variability and the generalizability. From the theoretical view, two lower bounds based on algorithm DB variability are proposed and do not explicitly depend on the sample size. We also prove an upper bound of order $\mathcal{O}\left(\frac{1}{\sqrt{m}}+\epsilon+\eta\log\frac{1}{\eta}\right)$ based on data DB variability. The bound is convenient to estimate without the requirement of labels, and does not explicitly depend on the network size which is usually prohibitively large in deep learning.
    Beyond Tabula Rasa: Reincarnating Reinforcement Learning. (arXiv:2206.01626v1 [cs.LG])
    Learning tabula rasa, that is without any prior knowledge, is the prevalent workflow in reinforcement learning (RL) research. However, RL systems, when applied to large-scale settings, rarely operate tabula rasa. Such large-scale systems undergo multiple design or algorithmic changes during their development cycle and use ad hoc approaches for incorporating these changes without re-training from scratch, which would have been prohibitively expensive. Additionally, the inefficiency of deep RL typically excludes researchers without access to industrial-scale resources from tackling computationally-demanding problems. To address these issues, we present reincarnating RL as an alternative workflow, where prior computational work (e.g., learned policies) is reused or transferred between design iterations of an RL agent, or from one RL agent to another. As a step towards enabling reincarnating RL from any agent to any other agent, we focus on the specific setting of efficiently transferring an existing sub-optimal policy to a standalone value-based RL agent. We find that existing approaches fail in this setting and propose a simple algorithm to address their limitations. Equipped with this algorithm, we demonstrate reincarnating RL's gains over tabula rasa RL on Atari 2600 games, a challenging locomotion task, and the real-world problem of navigating stratospheric balloons. Overall, this work argues for an alternative approach to RL research, which we believe could significantly improve real-world RL adoption and help democratize it further.
    Learning with convolution and pooling operations in kernel methods. (arXiv:2111.08308v2 [stat.ML] UPDATED)
    Recent empirical work has shown that hierarchical convolutional kernels inspired by convolutional neural networks (CNNs) significantly improve the performance of kernel methods in image classification tasks. A widely accepted explanation for their success is that these architectures encode hypothesis classes that are suitable for natural images. However, understanding the precise interplay between approximation and generalization in convolutional architectures remains a challenge. In this paper, we consider the stylized setting of covariates (image pixels) uniformly distributed on the hypercube, and characterize exactly the RKHS of kernels composed of single layers of convolution, pooling, and downsampling operations. We use this characterization to compute sharp asymptotics of the generalization error for any given function in high-dimension. In particular, we quantify the gain in sample complexity brought by enforcing locality with the convolution operation and approximate translation invariance with average pooling. Notably, these results provide a precise description of how convolution and pooling operations trade off approximation with generalization power in one layer convolutional kernels.
    Rethinking Class-Prior Estimation for Positive-Unlabeled Learning. (arXiv:2002.03673v2 [cs.LG] UPDATED)
    Given only positive (P) and unlabeled (U) data, PU learning can train a binary classifier without any negative data. It has two building blocks: PU class-prior estimation (CPE) and PU classification; the latter has been well studied while the former has received less attention. Hitherto, the distributional-assumption-free CPE methods rely on a critical assumption that the support of the positive data distribution cannot be contained in the support of the negative data distribution. If this is violated, those CPE methods will systematically overestimate the class prior; it is even worse that we cannot verify the assumption based on the data. In this paper, we rethink CPE for PU learning-can we remove the assumption to make CPE always valid? We show an affirmative answer by proposing Regrouping CPE (ReCPE) that builds an auxiliary probability distribution such that the support of the positive data distribution is never contained in the support of the negative data distribution. ReCPE can work with any CPE method by treating it as the base method. Theoretically, ReCPE does not affect its base if the assumption already holds for the original probability distribution; otherwise, it reduces the positive bias of its base. Empirically, ReCPE improves all state-of-the-art CPE methods on various datasets, implying that the assumption has indeed been violated here.
    Decentralized Optimistic Hyperpolicy Mirror Descent: Provably No-Regret Learning in Markov Games. (arXiv:2206.01588v1 [cs.LG])
    We study decentralized policy learning in Markov games where we control a single agent to play with nonstationary and possibly adversarial opponents. Our goal is to develop a no-regret online learning algorithm that (i) takes actions based on the local information observed by the agent and (ii) is able to find the best policy in hindsight. For such a problem, the nonstationary state transitions due to the varying opponent pose a significant challenge. In light of a recent hardness result \citep{liu2022learning}, we focus on the setting where the opponent's previous policies are revealed to the agent for decision making. With such an information structure, we propose a new algorithm, \underline{D}ecentralized \underline{O}ptimistic hype\underline{R}policy m\underline{I}rror de\underline{S}cent (DORIS), which achieves $\sqrt{K}$-regret in the context of general function approximation, where $K$ is the number of episodes. Moreover, when all the agents adopt DORIS, we prove that their mixture policy constitutes an approximate coarse correlated equilibrium. In particular, DORIS maintains a \textit{hyperpolicy} which is a distribution over the policy space. The hyperpolicy is updated via mirror descent, where the update direction is obtained by an optimistic variant of least-squares policy evaluation. Furthermore, to illustrate the power of our method, we apply DORIS to constrained and vector-valued MDPs, which can be formulated as zero-sum Markov games with a fictitious opponent.
    An alternative approach to train neural networks using monotone variational inequality. (arXiv:2202.08876v2 [stat.ML] UPDATED)
    Despite the vast empirical success of neural networks, theoretical understanding of the training procedures remains limited, especially in providing performance guarantees of testing performance due to the non-convex nature of the optimization problem. The current paper investigates an alternative approach of neural network training by reducing to another problem with convex structure -- to solve a monotone variational inequality (MVI) -- inspired by a recent work of (Juditsky & Nemirovsky, 2019). The solution to MVI can be found by computationally efficient procedures, and importantly, this leads to performance guarantee of $\ell_2$ and $\ell_{\infty}$ bounds on model recovery and prediction accuracy under the theoretical setting of training a single-layer linear neural network. In addition, we study the use of MVI for training multi-layer neural networks and propose a practical algorithm called \textit{stochastic variational inequality} (SVI), and demonstrate its applicability in training fully-connected neural networks and graph neural networks (GNN) (SVI is completely general and can be used to train other types of neural networks). We demonstrate the competitive or better performance of SVI compared to widely-used stochastic gradient descent methods on both synthetic and real network data prediction tasks regarding various performance metrics, especially in the improved efficiency in the early stage of training.
    BaCaDI: Bayesian Causal Discovery with Unknown Interventions. (arXiv:2206.01665v1 [cs.LG])
    Learning causal structures from observation and experimentation is a central task in many domains. For example, in biology, recent advances allow us to obtain single-cell expression data under multiple interventions such as drugs or gene knockouts. However, a key challenge is that often the targets of the interventions are uncertain or unknown. Thus, standard causal discovery methods can no longer be used. To fill this gap, we propose a Bayesian framework (BaCaDI) for discovering the causal structure that underlies data generated under various unknown experimental/interventional conditions. BaCaDI is fully differentiable and operates in the continuous space of latent probabilistic representations of both causal structures and interventions. This enables us to approximate complex posteriors via gradient-based variational inference and to reason about the epistemic uncertainty in the predicted structure. In experiments on synthetic causal discovery tasks and simulated gene-expression data, BaCaDI outperforms related methods in identifying causal structures and intervention targets. Finally, we demonstrate that, thanks to its rigorous Bayesian approach, our method provides well-calibrated uncertainty estimates.
    Robust Multi-Objective Bayesian Optimization Under Input Noise. (arXiv:2202.07549v4 [cs.LG] UPDATED)
    Bayesian optimization (BO) is a sample-efficient approach for tuning design parameters to optimize expensive-to-evaluate, black-box performance metrics. In many manufacturing processes, the design parameters are subject to random input noise, resulting in a product that is often less performant than expected. Although BO methods have been proposed for optimizing a single objective under input noise, no existing method addresses the practical scenario where there are multiple objectives that are sensitive to input perturbations. In this work, we propose the first multi-objective BO method that is robust to input noise. We formalize our goal as optimizing the multivariate value-at-risk (MVaR), a risk measure of the uncertain objectives. Since directly optimizing MVaR is computationally infeasible in many settings, we propose a scalable, theoretically-grounded approach for optimizing MVaR using random scalarizations. Empirically, we find that our approach significantly outperforms alternative methods and efficiently identifies optimal robust designs that will satisfy specifications across multiple metrics with high probability.
    MCD: Marginal Contrastive Discrimination for conditional density estimation. (arXiv:2206.01592v1 [stat.ML])
    We consider the problem of conditional density estimation, which is a major topic of interest in the fields of statistical and machine learning. Our method, called Marginal Contrastive Discrimination, MCD, reformulates the conditional density function into two factors, the marginal density function of the target variable and a ratio of density functions which can be estimated through binary classification. Like noise-contrastive methods, MCD can leverage state-of-the-art supervised learning techniques to perform conditional density estimation, including neural networks. Our benchmark reveals that our method significantly outperforms in practice existing methods on most density models and regression datasets.
    Efficient Mean Estimation with Pure Differential Privacy via a Sum-of-Squares Exponential Mechanism. (arXiv:2111.12981v2 [cs.DS] UPDATED)
    We give the first polynomial-time algorithm to estimate the mean of a $d$-variate probability distribution with bounded covariance from $\tilde{O}(d)$ independent samples subject to pure differential privacy. Prior algorithms for this problem either incur exponential running time, require $\Omega(d^{1.5})$ samples, or satisfy only the weaker concentrated or approximate differential privacy conditions. In particular, all prior polynomial-time algorithms require $d^{1+\Omega(1)}$ samples to guarantee small privacy loss with "cryptographically" high probability, $1-2^{-d^{\Omega(1)}}$, while our algorithm retains $\tilde{O}(d)$ sample complexity even in this stringent setting. Our main technique is a new approach to use the powerful Sum of Squares method (SoS) to design differentially private algorithms. SoS proofs to algorithms is a key theme in numerous recent works in high-dimensional algorithmic statistics -- estimators which apparently require exponential running time but whose analysis can be captured by low-degree Sum of Squares proofs can be automatically turned into polynomial-time algorithms with the same provable guarantees. We demonstrate a similar proofs to private algorithms phenomenon: instances of the workhorse exponential mechanism which apparently require exponential time but which can be analyzed with low-degree SoS proofs can be automatically turned into polynomial-time differentially private algorithms. We prove a meta-theorem capturing this phenomenon, which we expect to be of broad use in private algorithm design. Our techniques also draw new connections between differentially private and robust statistics in high dimensions. In particular, viewed through our proofs-to-private-algorithms lens, several well-studied SoS proofs from recent works in algorithmic robust statistics directly yield key components of our differentially private mean estimation algorithm.
    Nonstationary Bandit Learning via Predictive Sampling. (arXiv:2205.01970v2 [cs.LG] UPDATED)
    Although Thompson sampling is widely used in stationary environments, it does not effectively account for nonstationarities. To address this limitation, we propose predictive sampling, a policy that balances between exploration and exploitation in nonstationary bandit environments. It is equivalent to Thompson sampling when specialized to stationary environments, but much more effective across a range of nonstationary environments because it deprioritizes investment in acquiring information that will quickly lose relevance. To offer insight in the efficacy of predictive sampling, we establish a regret bound. This bound highlights dependence on the rate at which new information arrives to alter the environment. In addition, we conduct experiments on bandit environments with varying rates of information arrival and observe that predictive sampling outperforms Thompson sampling.
    Truly Mesh-free Physics-Informed Neural Networks. (arXiv:2206.01545v1 [cs.LG])
    Physics-informed Neural Networks (PINNs) have recently emerged as a principled way to include prior physical knowledge in form of partial differential equations (PDEs) into neural networks. Although generally viewed as being mesh-free, current approaches still rely on collocation points obtained within a bounded region, even in settings with spatially sparse signals. Furthermore, if the boundaries are not known, the selection of such a region may be arbitrary, resulting in a large proportion of collocation points being selected in areas of low relevance. To resolve this, we present a mesh-free and adaptive approach termed particle-density PINN (pdPINN), which is inspired by the microscopic viewpoint of fluid dynamics. Instead of sampling from a bounded region, we propose to sample directly from the distribution over the (fluids) particle positions, eliminating the need to introduce boundaries while adaptively focusing on the most relevant regions. This is achieved by reformulating the modeled fluid density as an unnormalized probability distribution from which we sample with dynamic Monte Carlo methods. We further generalize pdPINNs to different settings that allow interpreting a positive scalar quantity as a particle density, such as the evolution of the temperature in the heat equation. The utility of our approach is demonstrated on experiments for modeling (non-steady) compressible fluids in up to three dimensions and a two-dimensional diffusion problem, illustrating the high flexibility and sample efficiency compared to existing refinement methods for PINNs.
    ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero. (arXiv:1902.04522v5 [cs.AI] UPDATED)
    The AlphaGo, AlphaGo Zero, and AlphaZero series of algorithms are remarkable demonstrations of deep reinforcement learning's capabilities, achieving superhuman performance in the complex game of Go with progressively increasing autonomy. However, many obstacles remain in the understanding of and usability of these promising approaches by the research community. Toward elucidating unresolved mysteries and facilitating future research, we propose ELF OpenGo, an open-source reimplementation of the AlphaZero algorithm. ELF OpenGo is the first open-source Go AI to convincingly demonstrate superhuman performance with a perfect (20:0) record against global top professionals. We apply ELF OpenGo to conduct extensive ablation studies, and to identify and analyze numerous interesting phenomena in both the model training and in the gameplay inference procedures. Our code, models, selfplay datasets, and auxiliary data are publicly available at https://ai.facebook.com/tools/elf-opengo/.
    Adaptive Learning for Discovery. (arXiv:2205.14829v2 [stat.ML] UPDATED)
    In this paper, we study a sequential decision-making problem, called Adaptive Sampling for Discovery (ASD). Starting with a large unlabeled dataset, algorithms for ASD adaptively label the points with the goal to maximize the sum of responses. This problem has wide applications to real-world discovery problems, for example drug discovery with the help of machine learning models. ASD algorithms face the well-known exploration-exploitation dilemma. The algorithm needs to choose points that yield information to improve model estimates but it also needs to exploit the model. We rigorously formulate the problem and propose a general information-directed sampling (IDS) algorithm. We provide theoretical guarantees for the performance of IDS in linear, graph and low-rank models. The benefits of IDS are shown in both simulation experiments and real-data experiments for discovering chemical reaction conditions.
    KCRL: Krasovskii-Constrained Reinforcement Learning with Guaranteed Stability in Nonlinear Dynamical Systems. (arXiv:2206.01704v1 [cs.LG])
    Learning a dynamical system requires stabilizing the unknown dynamics to avoid state blow-ups. However, current reinforcement learning (RL) methods lack stabilization guarantees, which limits their applicability for the control of safety-critical systems. We propose a model-based RL framework with formal stability guarantees, Krasovskii Constrained RL (KCRL), that adopts Krasovskii's family of Lyapunov functions as a stability constraint. The proposed method learns the system dynamics up to a confidence interval using feature representation, e.g. Random Fourier Features. It then solves a constrained policy optimization problem with a stability constraint based on Krasovskii's method using a primal-dual approach to recover a stabilizing policy. We show that KCRL is guaranteed to learn a stabilizing policy in a finite number of interactions with the underlying unknown system. We also derive the sample complexity upper bound for stabilization of unknown nonlinear dynamical systems via the KCRL framework.
    Offline Reinforcement Learning with Causal Structured World Models. (arXiv:2206.01474v1 [cs.LG])
    Model-based methods have recently shown promising for offline reinforcement learning (RL), aiming to learn good policies from historical data without interacting with the environment. Previous model-based offline RL methods learn fully connected nets as world-models that map the states and actions to the next-step states. However, it is sensible that a world-model should adhere to the underlying causal effect such that it will support learning an effective policy generalizing well in unseen states. In this paper, We first provide theoretical results that causal world-models can outperform plain world-models for offline RL by incorporating the causal structure into the generalization error bound. We then propose a practical algorithm, oFfline mOdel-based reinforcement learning with CaUsal Structure (FOCUS), to illustrate the feasibility of learning and leveraging causal structure in offline RL. Experimental results on two benchmarks show that FOCUS reconstructs the underlying causal structure accurately and robustly. Consequently, it performs better than the plain model-based offline RL algorithms and other causal model-based RL algorithms.  ( 2 min )
    Optimal Weak to Strong Learning. (arXiv:2206.01563v1 [cs.LG])
    The classic algorithm AdaBoost allows to convert a weak learner, that is an algorithm that produces a hypothesis which is slightly better than chance, into a strong learner, achieving arbitrarily high accuracy when given enough training data. We present a new algorithm that constructs a strong learner from a weak learner but uses less training data than AdaBoost and all other weak to strong learners to achieve the same generalization bounds. A sample complexity lower bound shows that our new algorithm uses the minimum possible amount of training data and is thus optimal. Hence, this work settles the sample complexity of the classic problem of constructing a strong learner from a weak learner.
    Algorithmic Stability of Heavy-Tailed Stochastic Gradient Descent on Least Squares. (arXiv:2206.01274v1 [stat.ML])
    Recent studies have shown that heavy tails can emerge in stochastic optimization and that the heaviness of the tails has links to the generalization error. While these studies have shed light on interesting aspects of the generalization behavior in modern settings, they relied on strong topological and statistical regularity assumptions, which are hard to verify in practice. Furthermore, it has been empirically illustrated that the relation between heavy tails and generalization might not always be monotonic in practice, contrary to the conclusions of existing theory. In this study, we establish novel links between the tail behavior and generalization properties of stochastic gradient descent (SGD), through the lens of algorithmic stability. We consider a quadratic optimization problem and use a heavy-tailed stochastic differential equation as a proxy for modeling the heavy-tailed behavior emerging in SGD. We then prove uniform stability bounds, which reveal the following outcomes: (i) Without making any exotic assumptions, we show that SGD will not be stable if the stability is measured with the squared-loss $x\mapsto x^2$, whereas it in turn becomes stable if the stability is instead measured with a surrogate loss $x\mapsto |x|^p$ with some $p<2$. (ii) Depending on the variance of the data, there exists a \emph{`threshold of heavy-tailedness'} such that the generalization error decreases as the tails become heavier, as long as the tails are lighter than this threshold. This suggests that the relation between heavy tails and generalization is not globally monotonic. (iii) We prove matching lower-bounds on uniform stability, implying that our bounds are tight in terms of the heaviness of the tails. We support our theory with synthetic and real neural network experiments.
    Indirect Active Learning. (arXiv:2206.01454v1 [math.ST])
    Traditional models of active learning assume a learner can directly manipulate or query a covariate $X$ in order to study its relationship with a response $Y$. However, if $X$ is a feature of a complex system, it may be possible only to indirectly influence $X$ by manipulating a control variable $Z$, a scenario we refer to as Indirect Active Learning. Under a nonparametric model of Indirect Active Learning with a fixed budget, we study minimax convergence rates for estimating the relationship between $X$ and $Y$ locally at a point, obtaining different rates depending on the complexities and noise levels of the relationships between $Z$ and $X$ and between $X$ and $Y$. We also identify minimax rates for passive learning under comparable assumptions. In many cases, our results show that, while there is an asymptotic benefit to active learning, this benefit is fully realized by a simple two-stage learner that runs two passive experiments in sequence.
    Causal Transformer for Estimating Counterfactual Outcomes. (arXiv:2204.07258v2 [cs.LG] UPDATED)
    Estimating counterfactual outcomes over time from observational data is relevant for many applications (e.g., personalized medicine). Yet, state-of-the-art methods build upon simple long short-term memory (LSTM) networks, thus rendering inferences for complex, long-range dependencies challenging. In this paper, we develop a novel Causal Transformer for estimating counterfactual outcomes over time. Our model is specifically designed to capture complex, long-range dependencies among time-varying confounders. For this, we combine three transformer subnetworks with separate inputs for time-varying covariates, previous treatments, and previous outcomes into a joint network with in-between cross-attentions. We further develop a custom, end-to-end training procedure for our Causal Transformer. Specifically, we propose a novel counterfactual domain confusion loss to address confounding bias: it aims to learn adversarial balanced representations, so that they are predictive of the next outcome but non-predictive of the current treatment assignment. We evaluate our Causal Transformer based on synthetic and real-world datasets, where it achieves superior performance over current baselines. To the best of our knowledge, this is the first work proposing transformer-based architecture for estimating counterfactual outcomes from longitudinal data.
    Hypothesis testing for matched pairs with missing data by maximum mean discrepancy: An application to continuous glucose monitoring. (arXiv:2206.01590v1 [stat.ME])
    A frequent problem in statistical science is how to properly handle missing data in matched paired observations. There is a large body of literature coping with the univariate case. Yet, the ongoing technological progress in measuring biological systems raises the need for addressing more complex data, e.g., graphs, strings and probability distributions, among others. In order to fill this gap, this paper proposes new estimators of the maximum mean discrepancy (MMD) to handle complex matched pairs with missing data. These estimators can detect differences in data distributions under different missingness mechanisms. The validity of this approach is proven and further studied in an extensive simulation study, and results of statistical consistency are provided. Data from continuous glucose monitoring in a longitudinal population-based diabetes study are used to illustrate the application of this approach. By employing the new distributional representations together with cluster analysis, new clinical criteria on how glucose changes vary at the distributional level over five years can be explored.
    Accelerated first-order methods for convex optimization with locally Lipschitz continuous gradient. (arXiv:2206.01209v1 [math.OC])
    In this paper we develop accelerated first-order methods for convex optimization with locally Lipschitz continuous gradient (LLCG), which is beyond the well-studied class of convex optimization with Lipschitz continuous gradient. In particular, we first consider unconstrained convex optimization with LLCG and propose accelerated proximal gradient (APG) methods for solving it. The proposed APG methods are equipped with a verifiable termination criterion and enjoy an operation complexity of ${\cal O}(\varepsilon^{-1/2}\log \varepsilon^{-1})$ and ${\cal O}(\log \varepsilon^{-1})$ for finding an $\varepsilon$-residual solution of an unconstrained convex and strongly convex optimization problem, respectively. We then consider constrained convex optimization with LLCG and propose an first-order proximal augmented Lagrangian method for solving it by applying one of our proposed APG methods to approximately solve a sequence of proximal augmented Lagrangian subproblems. The resulting method is equipped with a verifiable termination criterion and enjoys an operation complexity of ${\cal O}(\varepsilon^{-1}\log \varepsilon^{-1})$ and ${\cal O}(\varepsilon^{-1/2}\log \varepsilon^{-1})$ for finding an $\varepsilon$-KKT solution of a constrained convex and strongly convex optimization problem, respectively. All the proposed methods in this paper are parameter-free or almost parameter-free except that the knowledge on convexity parameter is required. To the best of our knowledge, no prior studies were conducted to investigate accelerated first-order methods with complexity guarantees for convex optimization with LLCG. All the complexity results obtained in this paper are entirely new.
  • Open

    An idea to test AI safety is "a virtual world honey pot" (not fleshed out)
    I doubt this idea is unique. And I'm unsure it is workable but hear me out. When an AI becomes smarter than humans I (and others) see there would be a potential safety problem. I propose we sandbox the AI in a small virtual world and somehow give the AI the impression that it is reality. I also think the ai should become smarter than the virtual humans (another ai probably) over time and given more and more decision-making power like the scenario that could happen in the real world as AI produces more results/profit and is more trusted. This virtual world is a honeypot of sorts. You could potentially test different safety measures such as shut-off functionality, optimization functions, and adherence to ethical systems. You could put the AI you want to use in there without problems, but I see the fake humans and how realistic the small world is being a problem with the realisticity of the results. Does this concept have a name? Any ideas to make it more workable? Feel free to point out why it won't work. submitted by /u/TransracialAsian [link] [comments]  ( 1 min )
    Magic Spells - Neural-Art Experiment [4K60]
    submitted by /u/MLInsights [link] [comments]

  • Open

    "The big new idea for making self-driving cars that can go anywhere: The mainstream approach to driverless cars is slow and difficult. These startups think going all-in on AI will get there faster"
    submitted by /u/gwern [link] [comments]  ( 1 min )
    "Planning with Diffusion for Flexible Behavior Synthesis", Janner
    submitted by /u/gwern [link] [comments]
    "3RL: Task-Agnostic Continual Reinforcement Learning: In Praise of a Simple Baseline", Caccia et al 2022 {Amazon} (were complicated lifelong learning mechanisms ever necessary?)
    submitted by /u/gwern [link] [comments]  ( 1 min )
    "Boosting Search Engines with Interactive Agents", Ciaramita et al 2022 {G} (MuZero & Decision-Transformer T5 for sequences of queries)
    submitted by /u/gwern [link] [comments]  ( 1 min )
    Some help in streaming large dataset for RL
    Hi, everyone! I'm struggling searching for information regarding a particular training setup. I would like to use reinforcement learning on a large historic dataset, i.e. offline RL. The data currently is between 1 and 5 GB per day and it goes back ~20 years. I am not able to store locally the full dataset as it will soon more than double in size. I can access the data via an API which can stream the entire history at 20 milion messages per second. This means that retrieving the data on the fly and also replaying everything isn't an issue. How do I go about integrating the API call with in my project? What are the terms used to describe what I'm trying to do, so I can find the relevant documentation? For now, either Tensorflow or PyTorch frameworks are fine for me. But if other frameworks are preferable to achieve what I want please feel free to suggest. Thanks a lot in advance for any response. submitted by /u/gigio_s [link] [comments]  ( 2 min )
    Hedging derivatives with Deep RL
    submitted by /u/SatoshiNotMe [link] [comments]  ( 1 min )
    Human based reward mechanisms
    I was wondering where I could find the latest research into human based reward systems. I’m more specifically curious to find whether this has any application to subjective domains such as music and art where a well defined reward is hard to establish and where direct human input seems necessary. submitted by /u/Lopside1 [link] [comments]  ( 2 min )
  • Open

    Comparing data fabrics, data meshes and knowledge graphs
    Vendors, consultants, and their clients have been talking in data fabric terms for close to a decade now, if not longer. If “big data” was the problem to solve, then a data fabric suggested a ready solution. John Mashey, then chief scientist at Silicon Graphics, used the term “big data” to describe the wave of… Read More »Comparing data fabrics, data meshes and knowledge graphs The post Comparing data fabrics, data meshes and knowledge graphs appeared first on Data Science Central.  ( 5 min )
  • Open

    LingHacks IV!!!
    Signups for LingHacks IV are now open! What: LingHacks is the world’s first high school computational linguistics hackathon (24-hour invention competition), where you come together in teams to build a software or hardware project that solves a scientific or social problem. No experience is needed, and we'll provide exciting swag, cool prizes, workshops, and mentorship for you to gain skills in computer science, artificial intelligence, machine learning, and the exciting field of natural language processing. We’re proud to announce that this hackathon is completely virtual! Attendees: sign up here! When & Where: LingHacks IV will take place from June 18th, 2022 at 9 am CST to June 21st, 2022 VIRTUALLY! Thanks to our sponsors, the event is COMPLETELY FREE! Learn more about the hackathon on our website here! submitted by /u/Broad_Way_554 [link] [comments]  ( 1 min )
    Hack3: The Leading Online Hackathon for High Schoolers!
    ​ https://preview.redd.it/o5dg9n5mhu391.png?width=1270&format=png&auto=webp&s=0e0cddf11ae6e6955da5cf44403e877dec33eb5e Attention to curious high schoolers! Hack3 is hosting an online hackathon for high schoolers for 24 hours on June 25-26. In 2020, we connected nearly 300 students of all skill levels, to learn to build innovative projects that positively impacted the world. Over 100 attended our free classes led by industry professionals to learn new skills . Over twenty mentors were in our help desk to help participants when they needed help. Last year, our judges, mentors, and workshop instructors were affiliated with the likes of Stanford, Harvard, Amazon, NetApp, and Wikipedia. In 2021, we connected over 350 students of all skill levels, to learn to build innovative projects that positively impacted the world. Over 150 attended our free classes led by industry professionals to learn new skills. Over 30 mentors were in our help desk to help participants when they needed help. Last year, our judges, mentors, and workshop instructors were affiliated with the likes of Amazon, NetApp, Balsamiq, Nexus Bytes, Replit, Postman, and Wolfram Language. This year, with the lessons learned from 2021, we aim to host a competition consisting of over 500 participants, while targeting the underprivileged communities around the world. To help achieve our goal of providing a learning opportunity for everyone, we will be sponsoring internet access to those who need it to truly level the playing field for all. Are you down? Register on DevPost here. submitted by /u/thegreatestgemini [link] [comments]  ( 1 min )
    A new study from Deepmind has found that Transformers can achieve few-shot learning without being explicitly trained for it. The research shows that FSL emerges only when the training data is distributed in particular ways that are also observed in natural domains like language.
    The capacity of large transformer-based language models to do few-shot learning is intriguing. These models can be generalized from a few samples of a new topic that they haven’t been trained on before. Previous research in the field of meta-learning has shown how neural networks can execute few-shot learning from a few examples without the requirement for weight updates – this is also known as in-context learning because the output is conditioned on the context. To do this, the Deepmind researchers created a training program that specifically encourages in-context learning, a technique known as meta-training. The capacity for in-context learning in transformer language models, on the other hand, is emergent. Few-shot learning isn’t directly addressed in the model’s transformer architecture or learning aim. The discovery that many natural data sources, including natural language, deviate from normally supervised datasets due to a few significant traits inspired this idea. Natural data, for example, is ‘bursty’ in terms of time. That is, rather than tending to appear in clusters, a given entity (word, person, item, etc.) may have a distribution that is not uniform across time. These results give insight into why FSL is seen in large language models, and how we might achieve emergent FSL in other domains! Continue reading | Check out the paper submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Magic Battle Between Two Wizards (V2) - AI Experimental Story w/ GPT-3 [4K 60 FPS]
    submitted by /u/MLInsights [link] [comments]  ( 1 min )
    Scientific term papers. scientific term papers. Use GPT3 as inspiration, to reformulate texts, to make connections between scientific thoughts.
    I still have to write some scientific term papers. For this I would like to use artificial intelligence. as inspiration to reformulate texts to make connections between scientific thoughts. Some services charge 60-100 $ per month. I would like to implement the whole thing for free. Okay a little bit I can invest. Does anyone know a good workflow? To make the three points happen? Who can inspire me in my approach? submitted by /u/Apfelbluetenstecher [link] [comments]  ( 2 min )
    How to make an ai generate a random sentence
    I’ve seen things where ai can generate things like pictures, and scripts, after being shown other pictures and scripts. I want to learn how this works, but I’d like to make something simple. I’d like to make an ai that can generate sentences, based off of the sentences it has been given. How would I go about doing this? submitted by /u/confusionPrice [link] [comments]  ( 2 min )
    A Drop of Water - .. And it Sparkles in The Sunlight! [4K 60 FPS] AI Latent-Space Experiment
    submitted by /u/MLInsights [link] [comments]
    5+ Best Computer Vision Courses to know 2022 | Learn Computer Vision
    submitted by /u/Lakshmireddys [link] [comments]
    Snowflake: Each One Is Unique! [4K 60 FPS] AI ART
    submitted by /u/MLInsights [link] [comments]
  • Open

    [D] Semantic Hashing: are there any learning resources about it?
    Is there any book or tutorial paper about this technique that also shows details on how to implement it? I've been diving into the original paper (https://www.cs.utoronto.ca/~rsalakhu/papers/semantic_final.pdf) but it is not only too dense to understand, but also doesn't show any concern about implementation. Ps: tried youtube too, but there's so little about this technique. submitted by /u/SimpatoYamasaki [link] [comments]  ( 1 min )
    [D] Machine Learning - WAYR (What Are You Reading) - Week 139
    This is a place to share machine learning research papers, journals, and articles that you're reading this week. If it relates to what you're researching, by all means elaborate and give us your insight, otherwise it could just be an interesting paper you've read. Please try to provide some insight from your understanding and please don't post things which are present in wiki. Preferably you should link the arxiv page (not the PDF, you can easily access the PDF from the summary page but not the other way around) or any other pertinent links. Previous weeks : 1-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100 101-110 111-120 121-130 131-140 Week 1 Week 11 Week 21 Week 31 Week 41 Week 51 Week 61 Week 71 Week 81 Week 91 Week 101 Week 111 Week 121 Week 131 Week 2 Week 12 Week 22 Week 32 Week 42 Week 52 Week 62 Week 72 Week 82 Week 92 Week 102 Week 112 Week 122 Week 132 Week 3 Week 13 Week 23 Week 33 Week 43 Week 53 Week 63 Week 73 Week 83 Week 93 Week 103 Week 113 Week 123 Week 133 Week 4 Week 14 Week 24 Week 34 Week 44 Week 54 Week 64 Week 74 Week 84 Week 94 Week 104 Week 114 Week 124 Week 134 Week 5 Week 15 Week 25 Week 35 Week 45 Week 55 Week 65 Week 75 Week 85 Week 95 Week 105 Week 115 Week 125 Week 135 Week 6 Week 16 Week 26 Week 36 Week 46 Week 56 Week 66 Week 76 Week 86 Week 96 Week 106 Week 116 Week 126 Week 136 Week 7 Week 17 Week 27 Week 37 Week 47 Week 57 Week 67 Week 77 Week 87 Week 97 Week 107 Week 117 Week 127 Week 137 Week 8 Week 18 Week 28 Week 38 Week 48 Week 58 Week 68 Week 78 Week 88 Week 98 Week 108 Week 118 Week 128 Week 138 Week 9 Week 19 Week 29 Week 39 Week 49 Week 59 Week 69 Week 79 Week 89 Week 99 Week 109 Week 119 Week 129 Week 10 Week 20 Week 30 Week 40 Week 50 Week 60 Week 70 Week 80 Week 90 Week 100 Week 110 Week 120 Week 130 Most upvoted papers two weeks ago: /u/master3243: https://imagen.research.google/ Besides that, there are no rules, have fun. submitted by /u/ML_WAYR_bot [link] [comments]  ( 1 min )
    [D] Data Science with Anaconda on 50% off today!
    submitted by /u/alimhabidi [link] [comments]
    Recent grad looking for career advice [discussion]
    Just recently graduated university and am looking to get into the ai field. I’ve had a few ai courses and have fairly basic coding levels. Finding it almost impossible to get any type of paid job related to the feild. Wondering if anyones has advice on what steps I should take to increase my chances of getting a job. Thinking about doing some certifications, boot camps or finish up my masters. submitted by /u/NoMixs [link] [comments]  ( 1 min )
    Meta Gradient Descent [D]
    I am using a modified gradient descent algorithm that I invented, but it is very simple and I assume someone has already come up with it. I call it meta gradient descent because it adjusts the learning rate as it goes. It is the same as normal gradient descent except for one condition at the end of the loop: If the algorithm passes a local minimum/maximum it decreases the learning rate, else it increases the learning rate if (slope 0) || (slope > 0 && prevSlope < 0) { //algorithm passed the function min/max learningRate /= 2 } else { learningRate *= 1.05 } This helps when the algo has to minimize/maximize functions with massively different scales where a single constant learning rate would be detrimental. I also suspect that it is faster at honing the min/max in general, but I have not done any studies on it. Which is why I am here asking... Is anyone already familiar with this algorithm? does it have a name? what are the pros/cons of using it instead of regular gradient descent? I am always using this algorithm instead of regular gradient descent and I have no problems at all, but my programs are very simple machine learning wise so I suspect I am missing something. I love u <3 submitted by /u/iFARTONMEN [link] [comments]  ( 2 min )
    [D] Two quick questions about CNNs
    My goal is to detect an object in an image. The image is 960x720 and has 3 color channels. The object can be as small as 20 pixels. Does it make any sense to provide a learning model with additional channels that can be derived from the original image itself? For example let's say I provide the 960x720x3 image, but add an additional channel that's just the negative, or grayscale of the same image? I presume not, because the grayscale image would just be a linear combination of the 3 color channels, right? Or perhaps there's some significant computational advantage to that? Does it make sense to cut the image up into a bunch of (50%) overlapping squares and then feed these smaller images to my CNN? It's unlikely that the object will ever be larger than one of these squares, so that could make sense, right? It should be computationally much more efficient, because the CNN compute would be much lower, even despite the overhead of processing more (but smaller) squares. Thank you submitted by /u/tmuxed [link] [comments]  ( 1 min )
    [P] Using OpenAI's CLIP repository as a support, I was able to create a software to detect anything in an image at its original resolution!
    submitted by /u/blevlabs [link] [comments]  ( 1 min )
    [D] What do you do when you are stuck on an ML problem?
    If you work on an ML problem that you know is solvable, but you are not able to solve it - what do you do? submitted by /u/keremidk0 [link] [comments]  ( 1 min )
    [D] Categorical sequence prediction for maximum positive impact
    Dataset 1 : Contact records( call, email, face to face meet) with timestamp between doctors and medical representatives based on brands and country and some other details. Contains around 60000 records. Dataset 2 : Overall impact for each individual medical representative(continuous variable, also has negative values). Contains around 160 records. Has unique record for one medical rep. From this two datasets, I want to predict the next category in a contact sequence with the time gapp between them( i.e, if the 1st contact method is call or email, what should be the optimal second method and after what should it be done?) for the medical representaitve to have maximum positive impact. One rep can contact multiple doctors n number of times with n different methods How should I frame this problem? What type of modeling can elp me achieve higher than 90% accuracy? Thanks for the help in advance. submitted by /u/Aneervan [link] [comments]  ( 1 min )
    [R] Objective measurement of political bias using machine learning
    Sociologists frequently attempt to compare the degree of bias among left and right-wing voters. In a typical study setting, participants are asked a number of factual knowledge questions and the responses are compared to the correct answers. Then, the side whose partisans make the largest errors is declared to be the most biased. Unfortunately, this approach is extremely vulnerable to biases of the researchers themselves. Consciously or subconsciously, the researchers are likely to select questions whose answers favor their own side. For example, on the topic of Obama’s economic policies, a left-leaning researcher might prefer questions, such as “Did the unemployment rate decline under the Obama administration?” (correct answer: “yes”). In contrast, a right-leaning researcher working on the same topic, is more likely to ask “Did the adult employment rate rise under the Obama administration?” (correct answer: “no”). In both cases the study participants whose biases align with the biases of the researchers would be much more likely to answer the questions correctly. To address this problem, we built an SVD-based algorithm that measures partisan bias in question selection and corrects testing results based on the measurements. The algorithm’s accuracy improves with increasing data size, so if you have a few minutes to spare, please help our project by taking any of these tests: US: General politics US: Economic policies Environmental policies (Note: Participants are not expected to know the correct answers to all of the questions, but to use their intuition to give their best guess.) submitted by /u/omniverse71 [link] [comments]  ( 2 min )
    [R] It’s wild to see an AI literally eyeballing raytracing based on 100 photos to create a 3d scene you can step inside ☀️ Low key getting addicted to NeRF-ing imagery datasets🤩
    submitted by /u/imaginfinity [link] [comments]  ( 2 min )
    [D] Simple Questions Thread
    Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]
    [D] Connecting musical and visual latent spaces in a "harmonic way"
    Hello everyone. I am currently working on an interdisciplinary project regarding the analysis and visualization of music. My initial idea was to visualize music in a "cool way" (psychedelic animations, etc.) using ML. The visuals should somehow represent the structure of the music, e.g. melody, beat, etc. My basic idea is to have an encoder, which brings the music into a latent representation, and a generative decoder that produces images out of the latent code at a given time. Now, there isn't much paired musical-visual training data available, so my thinking was to somehow connect a pre-trained music encoder with a visual decoder. Also, to connect two modalities like this in an unsupervised way seems like a really interesting research area. My problem is that I don't really know how to achieve this. I would somehow need to connect the two latent spaces which each other in a "natural" and "harmonic" way. I was thinking about stuff like finding the least required "complexity" to go from one distribution to another and I have indeed found some papers regarding that. There is also stuff like CycleGAN, which trains this mapping in an end-to-end way. I just wanted to ask about this here, because honestly it seems like I could go in various directions with this and it is a bit overwhelming. I'd appreciate any kind of help, ideas, recommendations, discussions and any kind of interest. Thanks! Btw, I have already successfully trained a music encoder in a self-supervised fashion with a relatively large amount of data (~2000 hours) submitted by /u/Tomsen1410 [link] [comments]  ( 2 min )
    [R] GatedTabTransformer: State-of-the-art for tabular classification
    Check out: https://arxiv.org/abs/2201.00199 Code: https://github.com/radi-cho/GatedTabTransformer Abstract: Some of the most common machine learning pipelines involve manipulation of tabular data. The current state-of-the-art solution for tabular modeling is the TabTransformer by Amazon from 2020. It incorporates a Transformer block to track relationships between categorical features and makes use of a standard multilayer perceptron to output its final logits. We propose modifications outperforming it on binary classification tasks for three benchmark datasets with more than 1% AUROC gains. We process categorical embeddings with an attention mechanism and then concatenate them with continuous values to be fed through multiple layers of gated MLP - a neural network originally introduced for language tasks. submitted by /u/radi-cho [link] [comments]  ( 1 min )
    [N][D][R]Alleged plagiarism of a paper in EMNLP2020 by the best paper in NAACL2021
    Hi, everyone. I found that a paper in EMNLP2020 is plagiarized by the best paper in NAACL2021. EMNLP2020: Visually Grounded Compound PCFGs http://aclanthology.lst.uni-saarland.de/2020.emnlp-main.354.pdf https://preview.redd.it/0qr7v4mwnp391.png?width=865&format=png&auto=webp&s=5480ef6eb684670c63c000ff6fa9fb64386fb1cc NAACL2021: Video-aided Unsupervised Grammar Induction https://aclanthology.org/2021.naacl-main.119.pdf https://preview.redd.it/528ipfdxnp391.png?width=865&format=png&auto=webp&s=1c1052381966c2d4b7674dfee6d26ad7333a73a0 Almost the same model with different input emnlp naacl The same formulas and contents. emnlp naacl emnlp naacl ​ emnlp naacl Similar experiments and claims. emnlp naacl The same core component in implementation. The public code in naacl2021, vpcfg is copied from emnlp2020. ​ General speaking, the paper in naacl2021 shares the same method and task with the paper in emnlp2020, only the input difference i.e., text+image (emnlp2020) and video (naacl2021). Amazing. submitted by /u/NiM-HLT [link] [comments]  ( 2 min )

  • Open

    Inverse tetrahedral numbers
    The previous post looked at the tetrahedral numbers: 1, 4, 10, 20, 35, … We could invert the process of creating tetrahedral numbers and ask for what n is a given number the nth tetrahedral number. So the inverse of 1 is 1, the inverse of 4 is 2, the inverse of 10 is etc. […] Inverse tetrahedral numbers first appeared on John D. Cook.  ( 2 min )
    General tetrahedral numbers
    Start with a list of ones: 1, 1, 1, 1, 1, … Taking the partial sums of this sequence gives consecutive numbers. That is, the nth number of the new series is the sum of the first n terms of the previous series. 1, 2, 3, 4, 5, … If we take partial sums again, […] General tetrahedral numbers first appeared on John D. Cook.  ( 1 min )
  • Open

    Nvidia AI Supercomputer Creates 100,000 Brain Images } Nvidia Omniverse Simulates Nuclear Fusion Reactor | AI Speeds Up Stroke Diagnosis & Treatment | Robot Hand To Automate Apple Picking
    submitted by /u/getrich_or_diemining [link] [comments]  ( 1 min )
    My neural network from scratch is finally doing aomething :)
    submitted by /u/-i-hate-this-place- [link] [comments]  ( 1 min )
  • Open

    Nvidia AI Supercomputer Creates 100,000 Brain Images } Nvidia Omniverse Simulates Nuclear Fusion Reactor | AI Speeds Up Stroke Diagnosis & Treatment | Robot Hand To Automate Apple Picking
    submitted by /u/getrich_or_diemining [link] [comments]  ( 1 min )
    Prism: Refract or Disperse a Beam of Light - [4K 60 FPS] Neural-Art Visualization
    submitted by /u/MLInsights [link] [comments]
    Amazon AI Researchers Propose A New Model, Called RescoreBERT, That Trains A BERT Rescoring Model With Discriminative Objective Functions And Improves ASR Rescoring
    👉 While BERT trained with MLM distillation can improve WER by 3%-6% relative to LSTM, RescoreBERT, trained with a discriminative objective, can improve it by 7%-13% on the same test sets. The RescoreBERT model’s key component is a technique called rescoring. The second-pass language model trained from scratch on a small quantity of data can prioritize and accurately rerank the hypotheses of rare words thanks to the rescoring technique. Amazon’s prior work has been integrated to lower the computational expense of computing PLL scores. This is accomplished by feeding the output of the BERT model through a neural network trained to mimic the PLL scores awarded by a more significant “teacher” model. Because the distilled model is trained to match the teacher’s predictions of masked inputs, this process is known as MLM (masked language model) distillation. The distilled model’s output is interpolated with the original score to obtain a final score. This method minimizes latency by condensing PLL scores from a big BERT model to a much smaller BERT model. Continue reading | Check out the paper https://preview.redd.it/scrvvdqc8m391.png?width=1548&format=png&auto=webp&s=0c5e49ebbdd4b49738870480376c21f5a3085ca4 submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    What does a Tesla car see? Tesla Autopilot Explained in 10 Minutes
    submitted by /u/OnlyProggingForFun [link] [comments]
    What are cool things I can do with AI?
    What could be funny when I mix it with AI? submitted by /u/xXNOdrugsForMEXx [link] [comments]
    6 Best Artificial Intelligence courses for Healthcare You should learn 2022
    submitted by /u/Lakshmireddys [link] [comments]
    Seastar: Creature of The Sea (V1: W) - [4K 60 FPS] AI Generative Art Experiment
    submitted by /u/MLInsights [link] [comments]
    Iterative launches machine learning engineering management (MLEM) tool - to bridge the gap between ML engineers and DevOps teams
    submitted by /u/cmstrump [link] [comments]
    Salesforce AI Research Propose ‘ALPRO’: A New Video-And-Language Representation Learning (Pre-Training) Framework
    Salesforce AI Research has proposed a new video-and-language representation learning framework called ALPRO. This framework can be used for pre-training models to achieve state-of-the-art performance on tasks such as video-text retrieval and question answering. ALPRO follows the “pre-training-then-fine-tuning” paradigm utilized in the VLP techniques described previously but overcomes their drawbacks. The approach runs on poorly sampled video frames and achieves more efficient cross-modal alignment without explicit object detectors. The ultimate objective of the novel strategy is to enhance the performance of subsequent tasks, such as video-text retrieval and video question answering (video QA). As proposed in ALPRO, enhanced pre-training technique results in enhanced video-language representations, contributing to enhanced performance on subsequent tasks. Continue reading | Check out the paper and github submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    AI in Australia?
    Which companies are doing leading AI in Australia? I'm a tech pm aiming to get involved with more AI/ML wherever I can, and have only found very few companies here that have anything like significant capabilities. Any pointers and tips sincerely appreciated. submitted by /u/Rufawana [link] [comments]  ( 1 min )
    Magic Battle Between Two Wizards - by GPT-3 Narrated and Visualized in [4K 60 FPS] w/ VQGAN + CLIP
    submitted by /u/MLInsights [link] [comments]
  • Open

    [R] Joint Abductive and Inductive Neural Logical Reasoning
    Paper: https://arxiv.org/abs/2205.14591 Abstract: " Neural logical reasoning (NLR) is a fundamental task in knowledge discovery and artificial intelligence. NLR aims at answering multi-hop queries with logical operations on structured knowledge bases based on distributed representations of queries and answers. While previous neural logical reasoners can give specific entity-level answers, i.e., perform inductive reasoning from the perspective of logic theory, they are not able to provide descriptive concept-level answers, i.e., perform abductive reasoning, where each concept is a summary of a set of entities. In particular, the abductive reasoning task attempts to infer the explanations of each query with descriptive concepts, which make answers comprehensible to users and is of great us…  ( 1 min )
    [D] Where do we currently stand at in lottery ticket hypothesis research?
    What is the most recent research around the lottery ticket hypothesis? Which are the best papers with new techniques for finding winning tickets, are there open-source tools that "work"? Anyone knows digestible-easy resources to get started with LTH? submitted by /u/sid_276 [link] [comments]  ( 1 min )
    [D] Deploying SOTA models into my own projects
    What is the most common approach to use networks developed by other researchers? Until now I have been using huggingface's pipeline to deploy pre-trained models. But for some other cases, the only available implementation is the github repo published by the authors where the code simply verifies that the results in the paper are reproducible, for example this https://github.com/Turoad/CLRNet. The main aspect of this case is that there's no inference pipeline, but rather train/test functions with the benchmark dataset as input. How do you deploy their model? Do you copy the architecture layers in your own torch/tensorflow code and train it following their parameters or you tweak their repository code? submitted by /u/LanverYT [link] [comments]  ( 1 min )
    [P] Information Retrieval Explainability Summer Project
    hey y'all! we're starting a project summer project in collaboration between our community and some academic partners to create an open source library for explaining typical semantic search approaches (like vector search). There is also some chance that we might end up publishing some papers depending on what kind of results we get and how far we can push it. you all are welcome to join. most of the work will be done async but there will be weekly meetings to discuss questions and plans. there are some pre-reqs but we are welcoming anyone who is ambitious and interested enough to contribute meaningfully. more details: https://community.ai.science/explainable-information-retrieval-xir info session: https://www.eventbrite.ca/e/explainable-information-retrieval-xir-tickets-350540755837 some starter material: https://ai.science/l/236a6202-3495-4a8e-bbad-aedeee4bd21d let me know if you have any questions, or suggestions for a project like this submitted by /u/tdls_to [link] [comments]  ( 1 min )
    [D] Imbalance: Metric to Loss functions
    In the case of class imbalance, looks like the main suggestion is to start with a clear problem specific metric, to make sure one is solving the correct problem. However, this means the optimizer is not affected at all and will remain affected by class imbalance. The cost function is not adjusted. Metrics can at best effect the decision threshold, a single parameter lever. Is this sufficient? What will be better? submitted by /u/darn321 [link] [comments]  ( 2 min )
    [D] latent space: task specific?
    Usually, latent space is just feature space compression. This can be done with PCA, autoencoders etc. This approach is task/target independent. One can also learn task dependent latent space representations. In a vanilla fully connected NN, the last layer before the target prediction layer is probably one example. Is this a correct understanding? Are there different terminologies to differentiate task specific vs label free latent space? submitted by /u/darn321 [link] [comments]  ( 1 min )
    [D] Parameter optimisation as a language problem?
    Hello! I am thinking of an idea for research on the topic of parameter optimisation viewed as a language problem. Here is what I mean by that - There are already multiple big pre-trained language models such as CodeBERT which can generate good contextual embeddings for source code. So if they're used as a baseline and built upon, we can create a supervised learning pipeline that predicts code parameters which satisfy desired outcomes. For example if we have the function def f(x): return 2 + 2 * x - x*x we can ask the model to maximise it and to find that the desired x is 1. At the beginning we expect to be able to solve such simple optimisation problems, but with time we may derive methods which are able to solve for more parameters and complicated functions and probably even have such a m…  ( 3 min )
    [Discussion] Influence of the number of classes on the performance of triplet loss
    Does the number of classes need to be huge (1000+) in order to get good performance with triplet loss? I am experimenting with a dataset which has 15 classes and 200 examples for each class. I kept a few classes for the test set and trained on the rest by constructing triplets. The loss drops steadily on the train set but not so much on the test set, it overfits. I am contemplating is the reason for bad performance the low number of classes which enables the model to "remember" those classes in the train set. Also does the batch size play a big role here? should the batch size be low so the model doesnt see all the train classes in a batch? submitted by /u/user89320 [link] [comments]  ( 1 min )
    [D] Universities with research in AI/ML for Music?
    Does anybody know of any universities doing research in AI for music? I know Queen Mary University of London appears to have some programs, but I haven't seen any other universities with similar initiatives and It seems like every paper I see in that area is always either from DeepMind or Magenta. submitted by /u/Redplatypus14 [link] [comments]  ( 1 min )
    [N] Stanford's Machine Learning; End of an era...
    After 10 years and nearly 5 million enrollments, Stanford will be closing new enrollments for the Machine Learning course on Coursera from June 14, 2022. It will be replaced by a more in-depth Machine Learning Specialization by Stanford Online and Deeplearning.ai and will be available in June. The most iconic MOOC to ever exist? submitted by /u/pro_user_for_good [link] [comments]  ( 2 min )
  • Open

    task oriented dialogue systems
    what different models are being used for task oriented dialogue systems. How end to end optimization is different than modularized optimization. submitted by /u/Western-Age3148 [link] [comments]
    Question about Policy functions
    Consider that I have trained a reinforcement learning model, and that you now want to make a prediction using this model, which uses Continuous Control to predict two numbers, say, X and Y. Does the policy function return the values of X and Y that return the highest possible reward from the Q function given the particular observation, and all possible values of X and Y ? Or do different implementations use different policy functions. I am mostly interested in CQL, IQL and CRR for example. submitted by /u/uom_questions [link] [comments]  ( 1 min )

  • Open

    [R] Deep Learning Opacity in Scientific Discovery
    This paper argues that the uninterpreability of deep neural networks need not diminish AI's capacity to lead scientists to significant and justifiable breakthroughs. https://arxiv.org/abs/2206.00520 submitted by /u/learning_by_looking [link] [comments]
    [D] Titan V vs. RTX 3090 for deep learning (2022)
    Hi guys I was wondering which GPU is better for DL and which features you need to look for when buying GPUs for DL purposes. I guess Titan V might be better than RTX 3090 because is far more expensive (why would NVIDIA sell a worse GPU at a higher price), buy I may be wrong (?). submitted by /u/apssg96 [link] [comments]  ( 1 min )
    [D] We need a distributed platform to train and use huge free machine learning models.
    Free machine learning models like gpt-j-6b and gpt-neox are very small compared with gopher and gato. The only way we are going to get a comparable model to be free is to train it and run it in a distributed way, and for that we need a distributed platform capable of doing it easier to train and run this models in a distributed way. submitted by /u/ConsistentSense4760 [link] [comments]  ( 1 min )
    [D] How to combine Linear Regression & NLP result into one score?
    Hi guys, sorry if this is a daft question, I'm still new to this. Say I wanted to analyse an email for legitimacy - one component would be using the metadata (e.g. the 'from' address, sender name, and several other header fields) as features for linear regression (scam or no scam). A second, important part is parsing the body text, checking for spelling errors and pushy scammer sentiment. How best would I go about combining these analyses with the overall end goal of classification of the email as 'scam' or 'no scam'? Cheers! submitted by /u/eddiewastaken [link] [comments]  ( 1 min )
    [D] Research papers/suggestions for x-ray defect analysis
    I have a LOT (several millions) of unlabeled x-ray images of aluminium parts with small defects in them. Like air bubbles cracks, or voids. These are thousands of images which are basically identical except for the defects 0(apart from slight variations in position, rotation, noise etc), and thousands of different view directions/parts. All image dimensions are identical. The problem is that I do not have a single image with labels and would hence have to label them manually. Which I'd like to avoid as much as possible. I want to highlight/segment all the positions where defects occur. So a segmentation into defect or no defect. A kind of probability map on the image. What kind of method would be best to do this without labels. I was thinking that, the biggest entropy/change in a certain part/view direction would be the defect. Since it will be in a random spot, while everything else is basically identical. So I was thinking of using an autoencoder for a specific view and use the points of highest reconstruction error as a starting point for labeling. But this won't work universally. Maybe a contrastive learning method? submitted by /u/bluuerp [link] [comments]  ( 1 min )
    [P] This is the worst AI ever. (GPT-4chan model, trained on 3.5 years worth of /pol/ posts)
    https://youtu.be/efPrtcLdcdM GPT-4chan was trained on over 3 years of posts from 4chan's "politically incorrect" (/pol/) board. Website (try the model here): https://gpt-4chan.com Model: https://huggingface.co/ykilcher/gpt-4chan Code: https://github.com/yk/gpt-4chan-public Dataset: https://zenodo.org/record/3606810#.YpjGgexByDU ​ OUTLINE: 0:00 - Intro 0:30 - Disclaimers 1:20 - Elon, Twitter, and the Seychelles 4:10 - How I trained a language model on 4chan posts 6:30 - How good is this model? 8:55 - Building a 4chan bot 11:00 - Something strange is happening 13:20 - How the bot got unmasked 15:15 - Here we go again 18:00 - Final thoughts submitted by /u/ykilcher [link] [comments]  ( 3 min )
    [D] Can we talk about how Jeff Dean casually spent a graduate student's annual take-home salary for 0.03% improvement on Cifar-10
    I just read https://www.reddit.com/r/MachineLearning/comments/uyratt/d_i_dont_really_trust_papers_out_of_top_labs/ and I am disappointed with the lack of pushback by researchers and students across the field of the encroachment of industry in research and I would like to double down on the "lack of trust" point raised by u/MrAcurite u/MrAcurite points out two things: One, the big number they cite as the success metric is 99.43 on CIFAR-10, against a SotA of 99.40, so woop-de-fucking-doo in the grand scheme of things. The sum total is 17,810 core-hours. Let's assume that for someone who doesn't work at Google, you'd have to use on-demand pricing of $3.22/hr. This means that these trained models cost $57,348. ​ and one of the author replied in a comment I would also contend that …  ( 4 min )
    [D] class imbalance: over/under sampling and class reweight
    If there's unbalanced datasets, what's the way to proceed? The canonical answer seems to be over/under sampling and class reweighting (is there anything more?), but have these things really worked in practice for you? What's the actual experience and practical suggestion? When to use one over the other? submitted by /u/darn321 [link] [comments]  ( 3 min )
    [D] Good communities/newsletters/mailing lists/twitter accs for ML in healthcare/medical applications?
    Hi. ​ I'm looking for communities/mailing lists/newsletters or even good Twitter accounts that are dedicated about the topics of ML in medical applications and healthcare. Any recommendations? In most other applications there seems to be a ton of good options for communities etc, but I honestly could not find one for these topics. submitted by /u/feryet [link] [comments]  ( 1 min )
    [P] Hands on diffusion models
    A minimal example of the forward and reverse flow of diffusion models with equations from the paper and visualizations alongside the code: https://github.com/InFoCusp/diffusion_models I coded it up since I wanted to familiarize myself with rhe end to end flow. It uses a simple 2d dataset that can train within minutes. Hope others on this subreddit find it useful. submitted by /u/optimistdit [link] [comments]  ( 1 min )
    [D] Should non-replicable, costly, exa-scale machine learning models coming out of industry be seen as a products rather than research?
    View Poll submitted by /u/fromnighttilldawn [link] [comments]
  • Open

    Text to Image Question: Different images?
    Hi all, I've been experimenting with several text to image programs. If the prompt/seed is the same, and there isn't a "deviantart" or "trending on Artstation" shouldn't the image be the same? ​ ​ For example, I did this prompt twice on NightCafe and got two different images: "Mountain Lake with birds in the style of John Howe by Greg Rutkowski, Ted Nasmith, Daarken, Caspar David Friedrich, Louis Comfort Tiffany, John Stephens, Ivan Shishkin, and Albert Bierstadt" - weight: 1 "strange landscape detailed 8k resolution concept art digital illustration beautiful" - weight: 1 "noisy, dirty, unclear, Watermark, blur, blurry, bokeh, unbalanced undeveloped high contrast, soft edges" - weight: -1 submitted by /u/longtailwriting [link] [comments]  ( 1 min )
    Introducing LIHQ - High Quality Artificial Speaker (Open source in google colab)
    submitted by /u/johnGettings [link] [comments]
    AI versus corporate logos
    submitted by /u/magenta_placenta [link] [comments]
    What skills are important if I want to be a ML software engineer?
    Currently I'm a second year undergrad cs major. I want to pursue a job in applied artificial intelligence at a company like OpenAI or Tesla because I realized AI/ML actually excites me unlike full stack app development which have been most of my side projects so far. So far I've trained and deployed a simple object detection neural network with Tensorflow. What other side projects/skills can I and people like me work on to be a more attractive hire when we go on the job market? submitted by /u/ValerianMoonRunner [link] [comments]  ( 1 min )
    Meet ‘Codeball’ – A Deep Learning-based Automated Code Reviewer That Will Help Maintainers Review Github Pull Requests
    Data in all forms, whether photos, music, or code, is being created on a large scale. Many GitHub code submissions require reviewers to read over the code and recommend modifications. Reviewing code requires a significant amount of time and effort on the developer’s part. Codeball fills this void. Codeball is a code review AI that approves Pull Requests that a human would have authorized. The AI detects and accepts safe contributions, allowing reviewers to focus their efforts on the difficult ones. Allowing shorter wait times saves a significant amount of money during the evaluation process. Metadata from over a million Pull Requests and thousands of different repositories were used to train Codeball. Codeball extracts features for each PR using our proprietary derivation technique, constructing the bigger context in which the PR was filed. For example, how frequently and by whom were impacted files modified, the semantics of the diffs, and, of course, whether the Pull Request was accepted and merged without objections or further comments. As a prediction model, Codeball uses a deep learning model that has been trained on over 1M public and private contributions from different organizations. In doing so it employs a Multi-layer Perceptron classifier neural network. In its input layer, the model takes hundreds of inputs, has two hidden layers, and a single output assesses the chance of a Pull Request being granted. Each Pull Request has hundreds of indications (the input layer). They are broadly classified into three basic, derived, and categorical types. Continue reading | Checkout the Github action and FAQs submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Could an AI pull off the following nearly impossible thing?
    submitted by /u/BeginningInfluence55 [link] [comments]
    18 Mindblowing AI Art Images
    submitted by /u/kbf_ [link] [comments]
    Mythical Dragon - by GPT-3 Narrated and Visualized in [4K 60 FPS] w/ VQGAN + CLIP
    submitted by /u/MLInsights [link] [comments]
    Qualitative humanities research is crucial to AI · fast.ai
    submitted by /u/estasfuera [link] [comments]
    Researcher Says an Image Generating AI Invented Its Own Language
    submitted by /u/estasfuera [link] [comments]
    recruiting beta testers
    Based on artificial intelligence, 📷BLOONY📷 is a chatbot that can communicate and interact with you. BLOONY is NOT a chatbot that gives predefined answers to respond.After running 📷 beta testing programs, we have received a lot of feedback from BUBBLES.The feedback helped us to identify service improvement points and create a better BLOONY! Through running the 3rd beta testing program, we hope to find issues that were not detected and receive your opinion again.📷 We recommend BLOONY to those...- who want to study English by chatting with a friend- who have concerns that are difficult to tell their family and friends- who want to talk to someone early in the morning or late at night- who are looking for something newAnyone who has a Facebook account and a smartphone can apply! 📷 Sign up now (Link in bio -> Click ADD TO FRIENDS LIST 📷)📷 Details- Number of Testers Needed: Limited to the first 000 applicants*Recruiting testers on a first-come first-served basis (FCFS)- Who You Are: Anyone who is at least 14 years old and has a Facebook account and Facebook Messenger mobile app- Required Activity: Use of the beta service and submission of feedback- How to Apply: Submission of a Google form📷 Notes- BLOONY can deliver information that is different from the facts.- Once you are invited as a beta tester, our team will reach out to you individually. submitted by /u/Necessary-Narwhal-57 [link] [comments]  ( 1 min )
    Microsoft and AWS Collaborate To Develop ‘PyWhy’: A New Github Home For ‘DoWhy’ (A Causal Machine Learning Library From Microsoft)
    As computing systems become more actively involved in societally essential areas such as healthcare, education, and government, it is crucial to accurately forecast and comprehend these interventions’ causal repercussions. Traditional machine learning algorithms based on pattern recognition and correlational analyses are insufficient for decision-making without an A/B test. To fill this gap, Microsoft researchers created a platform that executes the process of causal inference analysis from start to finish to assist data scientists in better understanding and applying causal inference. They developed the DoWhy in 2018. Since then, the library has been doing precisely that, cultivating a community committed to using causal inference principles in data science. “DoWhy” is a Python package that attempts to encourage causal thinking and analysis, many ways machine learning libraries have done for prediction. DoWhy provides a four-step interface for causal inference that focuses on clearly modeling and confirming causal assumptions as feasible. Traditional machine learning approaches aim to anticipate a result. Consider a public utility business that wants to minimize their customers’ water use using a marketing and incentives campaign. The success of a rewards program is difficult to assess since any drop in water consumption by participating consumers is masked by their decision to engage in the program. Continue reading | Research Articles from Microsoft and Amazon submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    With so many new Text to Image "AI" emerging lately, is it not crazy to speculate about Text to Video?
    I imagine it would be a feat much more challenging then that of Text to Image, however with a massive enough dataset, and enough training, it surly would be something possible one day in the future. The most I have seen at this point has been morphing VQ-Gan images and those Dall-E 2 slide show style animations. Anyone see anything interesting in this area pop up? I ask out of general curiosity so excuse me if I have missed some glaring neural network. submitted by /u/BeginningRealistic49 [link] [comments]  ( 2 min )
    Chillies.
    submitted by /u/cookingandcraft [link] [comments]
    7 Best Natural Language Processing Courses (2022) | Best NLP Courses
    submitted by /u/Lakshmireddys [link] [comments]
    Origami Dragon - by GPT-3 Narrated and Visualized in [4K 60 FPS] w/ VQGAN + CLIP
    submitted by /u/MLInsights [link] [comments]
  • Open

    What is the best way for my agent to extract a query from the hidden states of its recurrent policy?
    I have an RL agent that has a recurrent policy. I would like the hidden states of the LSTM at timestep t to give the query for timestep t+1. I was thinking that through the step function, the agent could send the action and the hidden states of the LSTM to the environment, which could then at the next timestep send back the observation as well as the hidden states of the previous timestep. Would this make sense or is there a better way to do this? submitted by /u/No_Possibility_7588 [link] [comments]  ( 1 min )
    "You Can't Count on Luck: Why Decision Transformers Fail in Stochastic Environments", Paster et al 2022
    submitted by /u/gwern [link] [comments]  ( 1 min )
    How do transformers or very deep models "plan" ahead?
    I was watching this amazing lecture by Oriol Vinyals. On one slide, there is a question asking if the very deep models plan. Transformer models or models employed in applications like Dialogue Generation do not have a planning component but behave like they already have the dialogue planned. Dr. Vinyals mentioned that there are papers on "how transformers are building up knowledge to answer questions or do all sorts of very interesting analyses". Can any please refer to a few of such works? submitted by /u/yoctotoyotta [link] [comments]  ( 4 min )
    Question about SARSA and Q-Learning
    Hey everyone! I just finished coding my homework assignment and tried to run SARSA and Q-Learning on the Cliff problem (my University implemented this itself - not the OpenAI gym environment). I was kind of surprised to see the resulting Q(s,a) functions from the SARSA algorithm. While Q-Learning quickly converged to the solution which makes sense to me intuitively, SARSA does leads to some - at least to me currently - weird results. Sometimes, especially when using a small epsilon, the resulting values in the Q-table “seem” wrong to me - Q Learning always comes up with a Q function showing the minimum number of steps it takes to reach the goal, for each action and state - for actions that lead to falling down the cliff, this number is the same as from the initial state -100). Is this just an artifact of the number of episodes, and, if ran infinitively often would lead to the same results? Or is this due to the max operator in Q-Learning, leading to more “stable” results? submitted by /u/Garbage-Shoddy [link] [comments]  ( 1 min )
    Requesting Help Using GNN With PyTorch Geometries from NetworkX. Sample data provided to recreate error.
    Hello all, I'm trying to get a system going for a GNN where an agent will move along a network of nodes and edges. Each state the agent has traveled to a new node, and the total distance traveled goes up. My problem is in getting the network loaded in from a networkx graph. ​ Heres some code that reproduces the error: from torch_geometric.utils import from_networkx import networkx as nx nodes = [ (0, {'y': 37.3348363, 'x': -121.888113}), (1, {'y': 37.3353111, 'x': -121.887118}), (2, {'y': 37.3358288, 'x': -121.8860567}), ] edges = [ (0, 1, {'osmid': 358475012, 'oneway': False, 'highway': 'residential', 'length': 72.482, 'geometry': '', 'speed_kph': 25.0, 'bearing': 149.1}), (0, 2, {'osmid': [416909272, 680787590], 'oneway': False…  ( 2 min )
    "SayCan: Do As I Can, Not As I Say: Grounding Language in Robotic Affordances", Ahn et al 2022 {G} (language models powering robots)
    submitted by /u/gwern [link] [comments]  ( 1 min )
  • Open

    Train machine learning models using Amazon Keyspaces as a data source
    Many applications meant for industrial equipment maintenance, trade monitoring, fleet management, and route optimization are built using open-source Cassandra APIs and drivers to process data at high speeds and low latency. Managing Cassandra tables yourself can be time consuming and expensive. Amazon Keyspaces (for Apache Cassandra) lets you set up, secure, and scale Cassandra tables […]  ( 10 min )
    Improve organizational diversity, equity, and inclusion initiatives with Amazon Polly
    Organizational diversity, equity and inclusion (DEI) initiatives are at the forefront of companies across the globe. By constructing inclusive spaces with individuals from diverse backgrounds and experiences, businesses can better represent our mutual societal needs and deliver on objectives. In the article How Diversity Can Drive Innovation, Harvard Business Review states that companies that focus […]  ( 6 min )
    Use Serverless Inference to reduce testing costs in your MLOps pipelines
    Amazon SageMaker Serverless Inference is an inference option that enables you to easily deploy machine learning (ML) models for inference without having to configure or manage the underlying infrastructure. SageMaker Serverless Inference is ideal for applications with intermittent or unpredictable traffic. In this post, you’ll see how to use SageMaker Serverless Inference to reduce cost when […]  ( 5 min )
    Accelerate and improve recommender system training and predictions using Amazon SageMaker Feature Store
    Many companies must tackle the difficult use case of building a highly optimized recommender system. The challenge comes from processing large volumes of data to train and tune the model daily with new data and then make predictions based on user behavior during an active engagement. In this post, we show you how to use […]  ( 15 min )
    Translate, redact and analyze streaming data using SQL functions with Amazon Kinesis Data Analytics, Amazon Translate, and Amazon Comprehend
    You may have applications that generate streaming data that is full of records containing customer case notes, product reviews, and social media messages, in many languages. Your task is to identify the products that people are talking about, determine if they’re expressing positive or negative sentiment, translate their comments into a common language, and create […]  ( 15 min )
  • Open

    Visualizing C operator precedence
    Here’s an idea for visualizing C operator precedence. You snake your way through the diagram starting from left to right. Operators at the same precedence level are on the same horizontal level. Following the arrows for changing directions, you move from left-to-right through the operators that associate left-to-right and you move right-to-left through the operators […] Visualizing C operator precedence first appeared on John D. Cook.  ( 1 min )
  • Open

    AI versus corporate logos
    I recently started playing with DALL-E 2, which will attempt to generate an image to go with whatever text prompt you give it. Like its predecessor DALL-E, it uses CLIP, which OpenAI trained on a huge collection of internet images and nearby text. I've experimented with a few  ( 2 min )
    Bonus: More AI-generated logos
    AI Weirdness: the strange side of machine learning  ( 1 min )
  • Open

    Shopee — Price Match Guarantee
    Shopee is the leading e-commerce platform in Southeast Asia and Taiwan. Customers appreciate its easy, secure, and fast online shopping…  ( 7 min )
  • Open

    7+ Best Books to Learn Neural Networks in 2022 for Beginners (Updated)
    submitted by /u/Lakshmireddys [link] [comments]
  • Open

    How Infinitely Wide Neural Networks Benefit from Multi-task Learning -- an Exact Macroscopic Characterization. (arXiv:2112.15577v3 [cs.LG] UPDATED)
    In practice, multi-task learning (through learning features shared among tasks) is an essential property of deep neural networks (NNs). While infinite-width limits of NNs can provide a good intuition for their generalization behavior, the well-known infinite-width limits of NNs in the literature (e.g., neural tangent kernels) assume specific settings in which wide ReLU-NNs behave like shallow Gaussian Processes with a fixed kernel. Consequently, in such settings, these NNs lose their ability to benefit from multi-task learning in the infinite-width limit. In contrast, we prove that optimizing wide ReLU neural networks with at least one hidden layer using L2-regularization on the parameters enforces multi-task learning due to representation-learning - also in the limiting regime where the network width tends to infinity. We present an exact quantitative characterization of this infinite width limit in an appropriate function space that neatly describes multi-task learning.  ( 2 min )
    Masked Bayesian Neural Networks : Computation and Optimality. (arXiv:2206.00853v1 [stat.ML])
    As data size and computing power increase, the architectures of deep neural networks (DNNs) have been getting more complex and huge, and thus there is a growing need to simplify such complex and huge DNNs. In this paper, we propose a novel sparse Bayesian neural network (BNN) which searches a good DNN with an appropriate complexity. We employ the masking variables at each node which can turn off some nodes according to the posterior distribution to yield a nodewise sparse DNN. We devise a prior distribution such that the posterior distribution has theoretical optimalities (i.e. minimax optimality and adaptiveness), and develop an efficient MCMC algorithm. By analyzing several benchmark datasets, we illustrate that the proposed BNN performs well compared to other existing methods in the sense that it discovers well condensed DNN architectures with similar prediction accuracy and uncertainty quantification compared to large DNNs.
    Nest Your Adaptive Algorithm for Parameter-Agnostic Nonconvex Minimax Optimization. (arXiv:2206.00743v1 [math.OC])
    Adaptive algorithms like AdaGrad and AMSGrad are successful in nonconvex optimization owing to their parameter-agnostic ability -- requiring no a priori knowledge about problem-specific parameters nor tuning of learning rates. However, when it comes to nonconvex minimax optimization, direct extensions of such adaptive optimizers without proper time-scale separation may fail to work in practice. We provide such an example proving that the simple combination of Gradient Descent Ascent (GDA) with adaptive stepsizes can diverge if the primal-dual stepsize ratio is not carefully chosen; hence, a fortiori, such adaptive extensions are not parameter-agnostic. To address the issue, we formally introduce a Nested Adaptive framework, NeAda for short, that carries an inner loop for adaptively maximizing the dual variable with controllable stopping criteria and an outer loop for adaptively minimizing the primal variable. Such mechanism can be equipped with off-the-shelf adaptive optimizers and automatically balance the progress in the primal and dual variables. Theoretically, for nonconvex-strongly-concave minimax problems, we show that NeAda can achieve the near-optimal $\tilde{O}(\epsilon^{-2})$ and $\tilde{O}(\epsilon^{-4})$ gradient complexities respectively in the deterministic and stochastic settings, without prior information on the problem's smoothness and strong concavity parameters. To the best of our knowledge, this is the first algorithm that simultaneously achieves near-optimal convergence rates and parameter-agnostic adaptation in the nonconvex minimax setting. Numerically, we further illustrate the robustness of the NeAda family with experiments on simple test functions and a real-world application.
    A Fair Comparison of Two Popular Flat Minima Optimizers: Stochastic Weight Averaging vs. Sharpness-Aware Minimization. (arXiv:2202.00661v3 [cs.LG] UPDATED)
    Recently, flat-minima optimizers, which seek to find parameters in low loss neighborhoods, have been shown to improve upon stochastic and adaptive gradient-based optimizers for training neural networks. Two methods have received significant attention due to their impressive generalization performance and scalability: 1. Stochastic Weight Averaging (SWA), and 2. Sharpness Aware Minimization (SAM). However, there has been limited investigation into their properties and no systematic benchmarking of them. Previous work mainly evaluated SWA and SAM on different architectures and datasets. We fill this gap here by comparing the loss surfaces of the models trained with each method and through a broad benchmarking across computer vision, natural language processing, and graph representation learning tasks. We discover a number of surprising findings from these results, which we hope will help researchers further improve deep learning optimizers, and practitioners identify the right optimizer for their problem.
    Retrospective Approximation for Smooth Stochastic Optimization. (arXiv:2103.04392v2 [math.OC] UPDATED)
    Stochastic Gradient (SG) is the defacto iterative technique to solve stochastic optimization (SO) problems with a smooth (non-convex) objective $f$ and a stochastic first-order oracle. SG's attractiveness is due in part to its simplicity of executing a single step along the negative subsampled gradient direction to update the incumbent iterate. In this paper, we question SG's choice of executing a single step as opposed to multiple steps between subsample updates. Our investigation leads naturally to generalizing SG into Retrospective Approximation (RA) where, during each iteration, a "deterministic solver" executes possibly multiple steps on a subsampled deterministic problem and stops when further solving is deemed unnecessary from the standpoint of statistical efficiency. RA thus rigorizes what is appealing for implementation -- during each iteration, "plug in" a solver, e.g., L-BFGS line search or Newton-CG, as is, and solve only to the extent necessary. We develop a complete theory using relative error of the observed gradients as the principal object, demonstrating that almost sure and $L_1$ consistency of RA are preserved under especially weak conditions when sample sizes are increased at appropriate rates. We also characterize the iteration and oracle complexity (for linear and sub-linear solvers) of RA, and identify a practical termination criterion leading to optimal complexity rates. To subsume non-convex $f$, we present a certain "random central limit theorem" that incorporates the effect of curvature across all first-order critical points, demonstrating that the asymptotic behavior is described by a certain mixture of normals. The message from our numerical experiments is that the ability of RA to incorporate existing second-order deterministic solvers in a strategic manner might be important from the standpoint of dispensing with hyper-parameter tuning.
    ZOOpt: Toolbox for Derivative-Free Optimization. (arXiv:1801.00329v3 [cs.LG] UPDATED)
    Recent advances in derivative-free optimization allow efficient approximation of the global-optimal solutions of sophisticated functions, such as functions with many local optima, non-differentiable and non-continuous functions. This article describes the ZOOpt (Zeroth Order Optimization) toolbox that provides efficient derivative-free solvers and is designed easy to use. ZOOpt provides single-machine parallel optimization on the basis of python core and multi-machine distributed optimization for time-consuming tasks by incorporating with the Ray framework -- a famous platform for building distributed applications. ZOOpt particularly focuses on optimization problems in machine learning, addressing high-dimensional and noisy problems such as hyper-parameter tuning and direct policy search. The toolbox is maintained toward a ready-to-use tool in real-world machine learning tasks.
    Multi-source Domain Adaptation via Weighted Joint Distributions Optimal Transport. (arXiv:2006.12938v2 [cs.LG] UPDATED)
    The problem of domain adaptation on an unlabeled target dataset using knowledge from multiple labelled source datasets is becoming increasingly important. A key challenge is to design an approach that overcomes the covariate and target shift both among the sources, and between the source and target domains. In this paper, we address this problem from a new perspective: instead of looking for a latent representation invariant between source and target domains, we exploit the diversity of source distributions by tuning their weights to the target task at hand. Our method, named Weighted Joint Distribution Optimal Transport (WJDOT), aims at finding simultaneously an Optimal Transport-based alignment between the source and target distributions and a re-weighting of the sources distributions. We discuss the theoretical aspects of the method and propose a conceptually simple algorithm. Numerical experiments indicate that the proposed method achieves state-of-the-art performance on simulated and real-life datasets.
    Finite-Time Analysis of Entropy-Regularized Neural Natural Actor-Critic Algorithm. (arXiv:2206.00833v1 [cs.LG])
    Natural actor-critic (NAC) and its variants, equipped with the representation power of neural networks, have demonstrated impressive empirical success in solving Markov decision problems with large state spaces. In this paper, we present a finite-time analysis of NAC with neural network approximation, and identify the roles of neural networks, regularization and optimization techniques (e.g., gradient clipping and averaging) to achieve provably good performance in terms of sample complexity, iteration complexity and overparametrization bounds for the actor and the critic. In particular, we prove that (i) entropy regularization and averaging ensure stability by providing sufficient exploration to avoid near-deterministic and strictly suboptimal policies and (ii) regularization leads to sharp sample complexity and network width bounds in the regularized MDPs, yielding a favorable bias-variance tradeoff in policy optimization. In the process, we identify the importance of uniform approximation power of the actor neural network to achieve global optimality in policy optimization due to distributional shift.
    Bayesian Inference of Stochastic Dynamical Networks. (arXiv:2206.00858v1 [stat.ML])
    Network inference has been extensively studied in several fields, such as systems biology and social sciences. Learning network topology and internal dynamics is essential to understand mechanisms of complex systems. In particular, sparse topologies and stable dynamics are fundamental features of many real-world continuous-time networks. Given that usually only a partial set of nodes are able to observe, in this paper, we consider linear continuous-time systems to depict networks since they can model unmeasured nodes via transfer functions. Additionally, measurements tend to be noisy and with low and varying sampling frequencies. For this reason, we consider continuous-time models (CT) since discrete-time approximations often require fine-grained measurements and uniform sampling steps. The developed method applies dynamical structure functions (DSFs) derived from linear stochastic differential equations (SDEs) to describe networks of measured nodes. Further, a numerical sampling method, preconditioned Crank-Nicolson (pCN), is used to refine coarse-grained trajectories to improve inference accuracy. The simulation conducted on random and ring networks, and a synthetic biological network illustrate that our method achieves state-of-the-art performance compared with group sparse Bayesian learning (GSBL), BINGO, kernel-based methods, dynGENIE3, GENIE3 and ARNI. In particular, these are challenging networks, suggesting that the developed method can be applied under a wide range of contexts.
    Phase diagram of Stochastic Gradient Descent in high-dimensional two-layer neural networks. (arXiv:2202.00293v2 [stat.ML] UPDATED)
    Despite the non-convex optimization landscape, over-parametrized shallow networks are able to achieve global convergence under gradient descent. The picture can be radically different for narrow networks, which tend to get stuck in badly-generalizing local minima. Here we investigate the cross-over between these two regimes in the high-dimensional setting, and in particular investigate the connection between the so-called mean-field/hydrodynamic regime and the seminal approach of Saad & Solla. Focusing on the case of Gaussian data, we study the interplay between the learning rate, the time scale, and the number of hidden units in the high-dimensional dynamics of stochastic gradient descent (SGD). Our work builds on a deterministic description of SGD in high-dimensions from statistical physics, which we extend and for which we provide rigorous convergence rates.  ( 2 min )
    The Phenomenon of Policy Churn. (arXiv:2206.00730v1 [cs.LG])
    We identify and study the phenomenon of policy churn, that is, the rapid change of the greedy policy in value-based reinforcement learning. Policy churn operates at a surprisingly rapid pace, changing the greedy action in a large fraction of states within a handful of learning updates (in a typical deep RL set-up such as DQN on Atari). We characterise the phenomenon empirically, verifying that it is not limited to specific algorithm or environment properties. A number of ablations help whittle down the plausible explanations on why churn occurs to just a handful, all related to deep learning. Finally, we hypothesise that policy churn is a beneficial but overlooked form of implicit exploration that casts $\epsilon$-greedy exploration in a fresh light, namely that $\epsilon$-noise plays a much smaller role than expected.
    Metrizing Fairness. (arXiv:2205.15049v2 [cs.LG] UPDATED)
    We study supervised learning problems for predicting properties of individuals who belong to one of two demographic groups, and we seek predictors that are fair according to statistical parity. This means that the distributions of the predictions within the two groups should be close with respect to the Kolmogorov distance, and fairness is achieved by penalizing the dissimilarity of these two distributions in the objective function of the learning problem. In this paper, we showcase conceptual and computational benefits of measuring unfairness with integral probability metrics (IPMs) other than the Kolmogorov distance. Conceptually, we show that the generator of any IPM can be interpreted as a family of utility functions and that unfairness with respect to this IPM arises if individuals in the two demographic groups have diverging expected utilities. We also prove that the unfairness-regularized prediction loss admits unbiased gradient estimators if unfairness is measured by the squared $\mathcal L^2$-distance or by a squared maximum mean discrepancy. In this case, the fair learning problem is susceptible to efficient stochastic gradient descent (SGD) algorithms. Numerical experiments on real data show that these SGD algorithms outperform state-of-the-art methods for fair learning in that they achieve superior accuracy-unfairness trade-offs -- sometimes orders of magnitude faster. Finally, we identify conditions under which statistical parity can improve prediction accuracy.
    Revisiting the General Identifiability Problem. (arXiv:2206.01081v1 [cs.LG])
    We revisit the problem of general identifiability originally introduced in [Lee et al., 2019] for causal inference and note that it is necessary to add positivity assumption of observational distribution to the original definition of the problem. We show that without such an assumption the rules of do-calculus and consequently the proposed algorithm in [Lee et al., 2019] are not sound. Moreover, adding the assumption will cause the completeness proof in [Lee et al., 2019] to fail. Under positivity assumption, we present a new algorithm that is provably both sound and complete. A nice property of this new algorithm is that it establishes a connection between general identifiability and classical identifiability by Pearl [1995] through decomposing the general identifiability problem into a series of classical identifiability sub-problems.  ( 2 min )
    Boundary Graph Neural Networks for 3D Simulations. (arXiv:2106.11299v3 [cs.LG] UPDATED)
    The abundance of data has given machine learning considerable momentum in natural sciences and engineering, though modeling of physical processes is often difficult. A particularly tough problem is the efficient representation of geometric boundaries. Triangularized geometric boundaries are well understood and ubiquitous in engineering applications. However, it is notoriously difficult to integrate them into machine learning approaches due to their heterogeneity with respect to size and orientation. In this work, we introduce an effective theory to model particle-boundary interactions, which leads to our new Boundary Graph Neural Networks (BGNNs) that dynamically modify graph structures to obey boundary conditions. The new BGNNs are tested on complex 3D granular flow processes of hoppers, rotating drums and mixers, which are all standard components of modern industrial machinery but still have complicated geometry. BGNNs are evaluated in terms of computational efficiency as well as prediction accuracy of particle flows and mixing entropies. BGNNs are able to accurately reproduce 3D granular flows within simulation uncertainties over hundreds of thousands of simulation timesteps. Most notably, in our experiments, particles stay within the geometric objects without using handcrafted conditions or restrictions.
    Invertible Neural Networks for Graph Prediction. (arXiv:2206.01163v1 [stat.ML])
    In this work, we address conditional generation using deep invertible neural networks. This is a type of problem where one aims to infer the most probable inputs $X$ given outcomes $Y$. We call our method \textit{invertible graph neural network} (iGNN) due to the primary focus on generating node features on graph data. A notable feature of our proposed methods is that during network training, we revise the typically-used loss objective in normalizing flow and consider Wasserstein-2 regularization to facilitate the training process. Algorithmic-wise, we adopt an end-to-end training approach since our objective is to address prediction and generation in the forward and backward processes at once through a single model. Theoretically, we characterize the conditions for identifiability of a true mapping, the existence and invertibility of the mapping, and the expressiveness of iGNN in learning the mapping. Experimentally, we verify the performance of iGNN on both simulated and real-data datasets. We demonstrate through extensive numerical experiments that iGNN shows clear improvement over competing conditional generation benchmarks on high-dimensional and/or non-convex data.
    Indeterminacy in Latent Variable Models: Characterization and Strong Identifiability. (arXiv:2206.00801v1 [stat.ML])
    Most modern latent variable and probabilistic generative models, such as the variational autoencoder (VAE), have certain indeterminacies that are unresolvable even with an infinite amount of data. Recent applications of such models have indicated the need for \textit{strongly} identifiable models, in which an observation corresponds to a unique latent code. Progress has been made towards reducing model indeterminacies while maintaining flexibility, most notably by the iVAE (arXiv:1907.04809 [stat.ML]), which excludes many -- but not all -- indeterminacies. We construct a full theoretical framework for analyzing the indeterminacies of latent variable models, and characterize them precisely in terms of properties of the generator functions and the latent variable prior distributions. To illustrate, we apply the framework to better understand the structure of recent identifiability results. We then investigate how we might specify strongly identifiable latent variable models, and construct two such classes of models. One is a straightforward modification of iVAE; the other uses ideas from optimal transport and leads to novel models and connections to recent work.
    Weakly Supervised Representation Learning with Sparse Perturbations. (arXiv:2206.01101v1 [cs.LG])
    The theory of representation learning aims to build methods that provably invert the data generating process with minimal domain knowledge or any source of supervision. Most prior approaches require strong distributional assumptions on the latent variables and weak supervision (auxiliary information such as timestamps) to provide provable identification guarantees. In this work, we show that if one has weak supervision from observations generated by sparse perturbations of the latent variables--e.g. images in a reinforcement learning environment where actions move individual sprites--identification is achievable under unknown continuous latent distributions. We show that if the perturbations are applied only on mutually exclusive blocks of latents, we identify the latents up to those blocks. We also show that if these perturbation blocks overlap, we identify latents up to the smallest blocks shared across perturbations. Consequently, if there are blocks that intersect in one latent variable only, then such latents are identified up to permutation and scaling. We propose a natural estimation procedure based on this theory and illustrate it on low-dimensional synthetic and image-based experiments.
    Faster Rates of Convergence to Stationary Points in Differentially Private Optimization. (arXiv:2206.00846v1 [cs.LG])
    We study the problem of approximating stationary points of Lipschitz and smooth functions under $(\varepsilon,\delta)$-differential privacy (DP) in both the finite-sum and stochastic settings. A point $\widehat{w}$ is called an $\alpha$-stationary point of a function $F:\mathbb{R}^d\rightarrow\mathbb{R}$ if $\|\nabla F(\widehat{w})\|\leq \alpha$. We provide a new efficient algorithm that finds an $\tilde{O}\big(\big[\frac{\sqrt{d}}{n\varepsilon}\big]^{2/3}\big)$-stationary point in the finite-sum setting, where $n$ is the number of samples. This improves on the previous best rate of $\tilde{O}\big(\big[\frac{\sqrt{d}}{n\varepsilon}\big]^{1/2}\big)$. We also give a new construction that improves over the existing rates in the stochastic optimization setting, where the goal is to find approximate stationary points of the population risk. Our construction finds a $\tilde{O}\big(\frac{1}{n^{1/3}} + \big[\frac{\sqrt{d}}{n\varepsilon}\big]^{1/2}\big)$-stationary point of the population risk in time linear in $n$. Furthermore, under the additional assumption of convexity, we completely characterize the sample complexity of finding stationary points of the population risk (up to polylog factors) and show that the optimal rate on population stationarity is $\tilde \Theta\big(\frac{1}{\sqrt{n}}+\frac{\sqrt{d}}{n\varepsilon}\big)$. Finally, we show that our methods can be used to provide dimension-independent rates of $O\big(\frac{1}{\sqrt{n}}+\min\big(\big[\frac{\sqrt{rank}}{n\varepsilon}\big]^{2/3},\frac{1}{(n\varepsilon)^{2/5}}\big)\big)$ on population stationarity for Generalized Linear Models (GLM), where $rank$ is the rank of the design matrix, which improves upon the previous best known rate.
    The effective noise of Stochastic Gradient Descent. (arXiv:2112.10852v3 [cond-mat.dis-nn] UPDATED)
    Stochastic Gradient Descent (SGD) is the workhorse algorithm of deep learning technology. At each step of the training phase, a mini batch of samples is drawn from the training dataset and the weights of the neural network are adjusted according to the performance on this specific subset of examples. The mini-batch sampling procedure introduces a stochastic dynamics to the gradient descent, with a non-trivial state-dependent noise. We characterize the stochasticity of SGD and a recently-introduced variant, \emph{persistent} SGD, in a prototypical neural network model. In the under-parametrized regime, where the final training error is positive, the SGD dynamics reaches a stationary state and we define an effective temperature from the fluctuation-dissipation theorem, computed from dynamical mean-field theory. We use the effective temperature to quantify the magnitude of the SGD noise as a function of the problem parameters. In the over-parametrized regime, where the training error vanishes, we measure the noise magnitude of SGD by computing the average distance between two replicas of the system with the same initialization and two different realizations of SGD noise. We find that the two noise measures behave similarly as a function of the problem parameters. Moreover, we observe that noisier algorithms lead to wider decision boundaries of the corresponding constraint satisfaction problem.
    Counterfactual Phenotyping with Censored Time-to-Events. (arXiv:2202.11089v2 [cs.LG] UPDATED)
    Estimation of treatment efficacy of real-world clinical interventions involves working with continuous outcomes such as time-to-death, re-hospitalization, or a composite event that may be subject to censoring. Counterfactual reasoning in such scenarios requires decoupling the effects of confounding physiological characteristics that affect baseline survival rates from the effects of the interventions being assessed. In this paper, we present a latent variable approach to model heterogeneous treatment effects by proposing that an individual can belong to one of latent clusters with distinct response characteristics. We show that this latent structure can mediate the base survival rates and helps determine the effects of an intervention. We demonstrate the ability of our approach to discover actionable phenotypes of individuals based on their treatment response on multiple large randomized clinical trials originally conducted to assess appropriate treatments to reduce cardiovascular risk.
    Sequential Bayesian Neural Subnetwork Ensembles. (arXiv:2206.00794v1 [stat.ML])
    Deep neural network ensembles that appeal to model diversity have been used successfully to improve predictive performance and model robustness in several applications. Whereas, it has recently been shown that sparse subnetworks of dense models can match the performance of their dense counterparts and increase their robustness while effectively decreasing the model complexity. However, most ensembling techniques require multiple parallel and costly evaluations and have been proposed primarily with deterministic models, whereas sparsity induction has been mostly done through ad-hoc pruning. We propose sequential ensembling of dynamic Bayesian neural subnetworks that systematically reduce model complexity through sparsity-inducing priors and generate diverse ensembles in a single forward pass of the model. The ensembling strategy consists of an exploration phase that finds high-performing regions of the parameter space and multiple exploitation phases that effectively exploit the compactness of the sparse model to quickly converge to different minima in the energy landscape corresponding to high-performing subnetworks yielding diverse ensembles. We empirically demonstrate that our proposed approach surpasses the baselines of the dense frequentist and Bayesian ensemble models in prediction accuracy, uncertainty estimation, and out-of-distribution (OoD) robustness on CIFAR10, CIFAR100 datasets, and their out-of-distribution variants: CIFAR10-C, CIFAR100-C induced by corruptions. Furthermore, we found that our approach produced the most diverse ensembles compared to the approaches with a single forward pass and even compared to the approaches with multiple forward passes in some cases.
    Bridging the Gap: Unifying the Training and Evaluation of Neural Network Binary Classifiers. (arXiv:2009.01367v3 [cs.LG] UPDATED)
    While neural network binary classifiers are often evaluated on metrics such as Accuracy and $F_1$-Score, they are commonly trained with a cross-entropy objective. How can this training-evaluation gap be addressed? While specific techniques have been adopted to optimize certain confusion matrix based metrics, it is challenging or impossible in some cases to generalize the techniques to other metrics. Adversarial learning approaches have also been proposed to optimize networks via confusion matrix based metrics, but they tend to be much slower than common training methods. In this work, we propose a unifying approach to training neural network binary classifiers that combines a differentiable approximation of the Heaviside function with a probabilistic view of the typical confusion matrix values using soft sets. Our theoretical analysis shows the benefit of using our method to optimize for a given evaluation metric, such as $F_1$-Score, with soft sets, and our extensive experiments show the effectiveness of our approach in several domains.
    On Reinforcement Learning and Distribution Matching for Fine-Tuning Language Models with no Catastrophic Forgetting. (arXiv:2206.00761v1 [cs.LG])
    The availability of large pre-trained models is changing the landscape of Machine Learning research and practice, moving from a training-from-scratch to a fine-tuning paradigm. While in some applications the goal is to "nudge" the pre-trained distribution towards preferred outputs, in others it is to steer it towards a different distribution over the sample space. Two main paradigms have emerged to tackle this challenge: Reward Maximization (RM) and, more recently, Distribution Matching (DM). RM applies standard Reinforcement Learning (RL) techniques, such as Policy Gradients, to gradually increase the reward signal. DM prescribes to first make explicit the target distribution that the model is fine-tuned to approximate. Here we explore the theoretical connections between the two paradigms, and show that methods such as KL-control developed for RM can also be construed as belonging to DM. We further observe that while DM differs from RM, it can suffer from similar training difficulties, such as high gradient variance. We leverage connections between the two paradigms to import the concept of baseline into DM methods. We empirically validate the benefits of adding a baseline on an array of controllable language generation tasks such as constraining topic, sentiment, and gender distributions in texts sampled from a language model. We observe superior performance in terms of constraint satisfaction, stability and sample efficiency.
    Analyzing Lottery Ticket Hypothesis from PAC-Bayesian Theory Perspective. (arXiv:2205.07320v2 [cs.LG] UPDATED)
    The lottery ticket hypothesis (LTH) has attracted attention because it can explain why over-parameterized models often show high generalization ability. It is known that when we use iterative magnitude pruning (IMP), which is an algorithm to find sparse networks with high generalization ability that can be trained from the initial weights independently, called winning tickets, the initial large learning rate does not work well in deep neural networks such as ResNet. However, since the initial large learning rate generally helps the optimizer to converge to flatter minima, we hypothesize that the winning tickets have relatively sharp minima, which is considered a disadvantage in terms of generalization ability. In this paper, we confirm this hypothesis and show that the PAC-Bayesian theory can provide an explicit understanding of the relationship between LTH and generalization behavior. On the basis of our experimental findings that flatness is useful for improving accuracy and robustness to label noise and that the distance from the initial weights is deeply involved in winning tickets, we offer the PAC-Bayes bound using a spike-and-slab distribution to analyze winning tickets. Finally, we revisit existing algorithms for finding winning tickets from a PAC-Bayesian perspective and provide new insights into these methods.  ( 2 min )
    Exponential Convergence Rates of Classification Errors on Learning with SGD and Random Features. (arXiv:1911.05350v2 [stat.ML] UPDATED)
    Although kernel methods are widely used in many learning problems, they have poor scalability to large datasets. To address this problem, sketching and stochastic gradient methods are the most commonly used techniques to derive efficient large-scale learning algorithms. In this study, we consider solving a binary classification problem using random features and stochastic gradient descent. In recent research, an exponential convergence rate of the expected classification error under the strong low-noise condition has been shown. We extend these analyses to a random features setting, analyzing the error induced by the approximation of random features in terms of the distance between the generated hypothesis including population risk minimizers and empirical risk minimizers when using general Lipschitz loss functions, to show that an exponential convergence of the expected classification error is achieved even if random features approximation is applied. Additionally, we demonstrate that the convergence rate does not depend on the number of features and there is a significant computational benefit in using random features in classification problems because of the strong low-noise condition.
    Deep neural networks can stably solve high-dimensional, noisy, non-linear inverse problems. (arXiv:2206.00934v1 [math.NA])
    We study the problem of reconstructing solutions of inverse problems with neural networks when only noisy data is available. We assume the problem can be modeled with an infinite-dimensional forward operator that is not continuously invertible. Then, we restrict this forward operator to finite-dimensional spaces so that the inverse is Lipschitz continuous. For the inverse operator, we demonstrate that there exists a neural network which is a robust-to-noise approximation of the function. In addition, we show that these neural networks can be learned from appropriately perturbed training data. We demonstrate the admissibility of this approach to a wide range of inverse problems of practical interest. Numerical examples are given that support the theoretical findings.
    Score-Based Generative Models Detect Manifolds. (arXiv:2206.01018v1 [stat.ML])
    Score-based generative models (SGMs) need to approximate the scores $\nabla \log p_t$ of the intermediate distributions as well as the final distribution $p_T$ of the forward process. The theoretical underpinnings of the effects of these approximations are still lacking. We find precise conditions under which SGMs are able to produce samples from an underlying (low-dimensional) data manifold $\mathcal{M}$. This assures us that SGMs are able to generate the "right kind of samples". For example, taking $\mathcal{M}$ to be the subset of images of faces, we find conditions under which the SGM robustly produces an image of a face, even though the relative frequencies of these images might not accurately represent the true data generating distribution. Moreover, this analysis is a first step towards understanding the generalization properties of SGMs: Taking $\mathcal{M}$ to be the set of all training samples, our results provide a precise description of when the SGM memorizes its training data.
    Evaluating Modules in Graph Contrastive Learning. (arXiv:2106.08171v2 [cs.LG] UPDATED)
    The recent emergence of contrastive learning approaches facilitates the application on graph representation learning (GRL), introducing graph contrastive learning (GCL) into the literature. These methods contrast semantically similar and dissimilar sample pairs to encode the semantics into node or graph embeddings. However, most existing works only performed \textbf{model-level} evaluation, and did not explore the combination space of modules for more comprehensive and systematic studies. For effective \textbf{module-level} evaluation, we propose a framework that decomposes GCL models into four modules: (1) a \textbf{sampler} to generate anchor, positive and negative data samples (nodes or graphs); (2) an \textbf{encoder} and a \textbf{readout} function to get sample embeddings; (3) a \textbf{discriminator} to score each sample pair (anchor-positive and anchor-negative); and (4) an \textbf{estimator} to define the loss function. Based on this framework, we conduct controlled experiments over a wide range of architectural designs and hyperparameter settings on node and graph classification tasks. Specifically, we manage to quantify the impact of a single module, investigate the interaction between modules, and compare the overall performance with current model architectures. Our key findings include a set of module-level guidelines for GCL, e.g., simple samplers from LINE and DeepWalk are strong and robust; an MLP encoder associated with Sum readout could achieve competitive performance on graph classification. Finally, we release our implementations and results as OpenGCL, a modularized toolkit that allows convenient reproduction, standard model and module evaluation, and easy extension. OpenGCL is available at \url{https://github.com/thunlp/OpenGCL}.
    On the Global Convergence Rates of Softmax Policy Gradient Methods. (arXiv:2005.06392v3 [cs.LG] UPDATED)
    We make three contributions toward better understanding policy gradient methods in the tabular setting. First, we show that with the true gradient, policy gradient with a softmax parametrization converges at a $O(1/t)$ rate, with constants depending on the problem and initialization. This result significantly expands the recent asymptotic convergence results. The analysis relies on two findings: that the softmax policy gradient satisfies a \L{}ojasiewicz inequality, and the minimum probability of an optimal action during optimization can be bounded in terms of its initial value. Second, we analyze entropy regularized policy gradient and show that it enjoys a significantly faster linear convergence rate $O(e^{-c \cdot t})$ toward softmax optimal policy $(c > 0)$. This result resolves an open question in the recent literature. Finally, combining the above two results and additional new $\Omega(1/t)$ lower bound results, we explain how entropy regularization improves policy optimization, even with the true gradient, from the perspective of convergence rate. The separation of rates is further explained using the notion of non-uniform \L{}ojasiewicz degree. These results provide a theoretical understanding of the impact of entropy and corroborate existing empirical studies.
    Feature Space Particle Inference for Neural Network Ensembles. (arXiv:2206.00944v1 [cs.LG])
    Ensembles of deep neural networks demonstrate improved performance over single models. For enhancing the diversity of ensemble members while keeping their performance, particle-based inference methods offer a promising approach from a Bayesian perspective. However, the best way to apply these methods to neural networks is still unclear: seeking samples from the weight-space posterior suffers from inefficiency due to the over-parameterization issues, while seeking samples directly from the function-space posterior often results in serious underfitting. In this study, we propose optimizing particles in the feature space where the activation of a specific intermediate layer lies to address the above-mentioned difficulties. Our method encourages each member to capture distinct features, which is expected to improve ensemble prediction robustness. Extensive evaluation on real-world datasets shows that our model significantly outperforms the gold-standard Deep Ensembles on various metrics, including accuracy, calibration, and robustness. Code is available at https://github.com/DensoITLab/featurePI .
    Boosting Independent Component Analysis. (arXiv:2112.06920v3 [stat.ML] UPDATED)
    Independent component analysis is intended to recover the mutually independent components from their linear mixtures. This technique has been widely used in many fields, such as data analysis, signal processing, and machine learning. To alleviate the dependency on prior knowledge concerning unknown sources, many nonparametric methods have been proposed. In this paper, we present a novel boosting-based algorithm for independent component analysis. Our algorithm consists of maximizing likelihood estimation via boosting and seeking unmixing matrix by the fixed-point method. A variety of experiments validate its performance compared with many of the presently known algorithms.
    Gradient flow dynamics of shallow ReLU networks for square loss and orthogonal inputs. (arXiv:2206.00939v1 [stat.ML])
    The training of neural networks by gradient descent methods is a cornerstone of the deep learning revolution. Yet, despite some recent progress, a complete theory explaining its success is still missing. This article presents, for orthogonal input vectors, a precise description of the gradient flow dynamics of training one-hidden layer ReLU neural networks for the mean squared error at small initialisation. In this setting, despite non-convexity, we show that the gradient flow converges to zero loss and characterise its implicit bias towards minimum variation norm. Furthermore, some interesting phenomena are highlighted: a quantitative description of the initial alignment phenomenon and a proof that the process follows a specific saddle to saddle dynamics.
    Trajectory of Mini-Batch Momentum: Batch Size Saturation and Convergence in High Dimensions. (arXiv:2206.01029v1 [math.OC])
    We analyze the dynamics of large batch stochastic gradient descent with momentum (SGD+M) on the least squares problem when both the number of samples and dimensions are large. In this setting, we show that the dynamics of SGD+M converge to a deterministic discrete Volterra equation as dimension increases, which we analyze. We identify a stability measurement, the implicit conditioning ratio (ICR), which regulates the ability of SGD+M to accelerate the algorithm. When the batch size exceeds this ICR, SGD+M converges linearly at a rate of $\mathcal{O}(1/\sqrt{\kappa})$, matching optimal full-batch momentum (in particular performing as well as a full-batch but with a fraction of the size). For batch sizes smaller than the ICR, in contrast, SGD+M has rates that scale like a multiple of the single batch SGD rate. We give explicit choices for the learning rate and momentum parameter in terms of the Hessian spectra that achieve this performance.
    Bayesian Inference for the Multinomial Probit Model under Gaussian Prior Distribution. (arXiv:2206.00720v1 [stat.ME])
    Multinomial probit (mnp) models are fundamental and widely-applied regression models for categorical data. Fasano and Durante (2022) proved that the class of unified skew-normal distributions is conjugate to several mnp sampling models. This allows to develop Monte Carlo samplers and accurate variational methods to perform Bayesian inference. In this paper, we adapt the abovementioned results for a popular special case: the discrete-choice mnp model under zero mean and independent Gaussian priors. This allows to obtain simplified expressions for the parameters of the posterior distribution and an alternative derivation for the variational algorithm that gives a novel understanding of the fundamental results in Fasano and Durante (2022) as well as computational advantages in our special settings.
    Estimating the Arc Length of the Optimal ROC Curve and Lower Bounding the Maximal AUC. (arXiv:2110.09651v2 [math.ST] UPDATED)
    We show that when the data likelihood ratio is used as the score function, the arc length of the corresponding ROC curve gives rise to a novel $f$-divergence which measures differences between the positive and negative data distributions. This $f$-divergence can be expressed using a variational objective and estimated only using samples from the positive and negative \emph{data} distributions. We show the empirical version of this variational objective is also a consistent estimator for the arctangent likelihood ratio with a non-parametric convergence rate $O_p(n^{-\beta/4})$ ($\beta \in (0,1]$ depends on the smoothness). Moreover, we show the surface area below the optimal ROC curve can be expressed as a similar variational objective depending on the arctangent likelihood ratio. These new insights lead to a novel two-step procedure for finding a good score function by lower bounding the maximal AUC. Experiments on CIFAR-10 datasets show the proposed two-step procedure achieves good AUC performance in imbalanced binary classification tasks while being less computationally demanding.
    Compositional Coding Capsule Network with K-Means Routing for Text Classification. (arXiv:1810.09177v5 [cs.LG] UPDATED)
    Text classification is a challenging problem which aims to identify the category of texts. In the process of training, word embeddings occupy a large part of parameters. Under the limitation of limited computing resources, it indirectly limits the ability of subsequent network designs. In order to reduce the number of parameters, the compositional coding mechanism has been proposed recently. Based on this, this paper further explores compositional coding and proposes a compositional weighted coding method. And we apply capsule network to model the relationship between word embeddings, a new routing algorithm, which is based on k-means clustering theory, is proposed to fully mine the relationship between word embeddings. Combined with our compositional weighted coding method and the routing algorithm, we design a neural network for text classification. Experiments conducted on eight challenging text classification datasets show that the proposed method achieves competitive accuracy compared to the state-of-the-art approach with significantly fewer parameters.
    An optimal transport approach for selecting a representative subsample with application in efficient kernel density estimation. (arXiv:2206.01182v1 [stat.ML])
    Subsampling methods aim to select a subsample as a surrogate for the observed sample. Such methods have been used pervasively in large-scale data analytics, active learning, and privacy-preserving analysis in recent decades. Instead of model-based methods, in this paper, we study model-free subsampling methods, which aim to identify a subsample that is not confined by model assumptions. Existing model-free subsampling methods are usually built upon clustering techniques or kernel tricks. Most of these methods suffer from either a large computational burden or a theoretical weakness. In particular, the theoretical weakness is that the empirical distribution of the selected subsample may not necessarily converge to the population distribution. Such computational and theoretical limitations hinder the broad applicability of model-free subsampling methods in practice. We propose a novel model-free subsampling method by utilizing optimal transport techniques. Moreover, we develop an efficient subsampling algorithm that is adaptive to the unknown probability density function. Theoretically, we show the selected subsample can be used for efficient density estimation by deriving the convergence rate for the proposed subsample kernel density estimator. We also provide the optimal bandwidth for the proposed estimator. Numerical studies on synthetic and real-world datasets demonstrate the performance of the proposed method is superior.
    Posterior Coreset Construction with Kernelized Stein Discrepancy for Model-Based Reinforcement Learning. (arXiv:2206.01162v1 [cs.LG])
    In this work, we propose a novel ${\bf K}$ernelized ${\bf S}$tein Discrepancy-based Posterior Sampling for ${\bf RL}$ algorithm (named $\texttt{KSRL}$) which extends model-based RL based upon posterior sampling (PSRL) in several ways: we (i) relax the need for any smoothness or Gaussian assumptions, allowing for complex mixture models; (ii) ensure it is applicable to large-scale training by incorporating a compression step such that the posterior consists of a \emph{Bayesian coreset} of only statistically significant past state-action pairs; and (iii) develop a novel regret analysis of PSRL based upon integral probability metrics, which, under a smoothness condition on the constructed posterior, can be evaluated in closed form as the kernelized Stein discrepancy (KSD). Consequently, we are able to improve the $\mathcal{O}(H^{3/2}d\sqrt{T})$ {regret} of PSRL to $\mathcal{O}(H^{3/2}\sqrt{T})$, where $d$ is the input dimension, $H$ is the episode length, and $T$ is the total number of episodes experienced, alleviating a linear dependence on $d$ . Moreover, we theoretically establish a trade-off between regret rate with posterior representational complexity via introducing a compression budget parameter $\epsilon$ based on KSD, and establish a lower bound on the required complexity for consistency of the model. Experimentally, we observe that this approach is competitive with several state of the art RL methodologies, with substantive improvements in computation time. Experimentally, we observe that this approach is competitive with several state of the art RL methodologies, and can achieve up-to $50\%$ reduction in wall clock time in some continuous control environments.
    Sparse Mixed Linear Regression with Guarantees: Taming an Intractable Problem with Invex Relaxation. (arXiv:2206.01167v1 [cs.LG])
    In this paper, we study the problem of sparse mixed linear regression on an unlabeled dataset that is generated from linear measurements from two different regression parameter vectors. Since the data is unlabeled, our task is not only to figure out a good approximation of the regression parameter vectors but also to label the dataset correctly. In its original form, this problem is NP-hard. The most popular algorithms to solve this problem (such as Expectation-Maximization) have a tendency to stuck at local minima. We provide a novel invex relaxation for this intractable problem which leads to a solution with provable theoretical guarantees. This relaxation enables exact recovery of data labels. Furthermore, we recover a close approximation of the regression parameter vectors which match the true parameter vectors in support and sign. Our formulation uses a carefully constructed primal dual witnesses framework for the invex problem. Furthermore, we show that the sample complexity of our method is only logarithmic in terms of the dimension of the regression parameter vectors.
    Efficient $\Phi$-Regret Minimization in Extensive-Form Games via Online Mirror Descent. (arXiv:2205.15294v2 [cs.LG] UPDATED)
    A conceptually appealing approach for learning Extensive-Form Games (EFGs) is to convert them to Normal-Form Games (NFGs). This approach enables us to directly translate state-of-the-art techniques and analyses in NFGs to learning EFGs, but typically suffers from computational intractability due to the exponential blow-up of the game size introduced by the conversion. In this paper, we address this problem in natural and important setups for the \emph{$\Phi$-Hedge} algorithm -- A generic algorithm capable of learning a large class of equilibria for NFGs. We show that $\Phi$-Hedge can be directly used to learn Nash Equilibria (zero-sum settings), Normal-Form Coarse Correlated Equilibria (NFCCE), and Extensive-Form Correlated Equilibria (EFCE) in EFGs. We prove that, in those settings, the \emph{$\Phi$-Hedge} algorithms are equivalent to standard Online Mirror Descent (OMD) algorithms for EFGs with suitable dilated regularizers, and run in polynomial time. This new connection further allows us to design and analyze a new class of OMD algorithms based on modifying its log-partition function. In particular, we design an improved algorithm with balancing techniques that achieves a sharp $\widetilde{\mathcal{O}}(\sqrt{XAT})$ EFCE-regret under bandit-feedback in an EFG with $X$ information sets, $A$ actions, and $T$ episodes. To our best knowledge, this is the first such rate and matches the information-theoretic lower bound.
    Primal-dual extrapolation methods for monotone inclusions under local Lipschitz continuity with applications to variational inequality, conic constrained saddle point, and convex conic optimization problems. (arXiv:2206.00973v1 [math.OC])
    In this paper we consider a class of structured monotone inclusion (MI) problems that consist of finding a zero in the sum of two monotone operators, in which one is maximal monotone while another is locally Lipschitz continuous. In particular, we first propose a primal-dual extrapolation (PDE) method for solving a structured strongly MI problem by modifying the classical forward-backward splitting method by using a point and operator extrapolation technique, in which the parameters are adaptively updated by a backtracking line search scheme. The proposed PDE method is almost parameter-free, equipped with a verifiable termination criterion, and enjoys an operation complexity of ${\cal O}(\log \epsilon^{-1})$, measured by the amount of fundamental operations consisting only of evaluations of one operator and resolvent of another operator, for finding an $\epsilon$-residual solution of the structured strongly MI problem. We then propose another PDE method for solving a structured non-strongly MI problem by applying the above PDE method to approximately solve a sequence of structured strongly MI problems. The resulting PDE method is parameter-free, equipped with a verifiable termination criterion, and enjoys an operation complexity of ${\cal O}(\epsilon^{-1}\log \epsilon^{-1})$ for finding an $\epsilon$-residual solution of the structured non-strongly MI problem. As a consequence, we apply the latter PDE method to convex conic optimization, conic constrained saddle point, and variational inequality problems, and obtain complexity results for finding an $\epsilon$-KKT or $\epsilon$-residual solution of them under local Lipschitz continuity. To the best of our knowledge, no prior studies were conducted to investigate methods with complexity guarantees for solving the aforementioned problems under local Lipschitz continuity. All the complexity results obtained in this paper are entirely new.
    DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps. (arXiv:2206.00927v1 [cs.LG])
    Diffusion probabilistic models (DPMs) are emerging powerful generative models. Despite their high-quality generation performance, DPMs still suffer from their slow sampling as they generally need hundreds or thousands of sequential function evaluations (steps) of large neural networks to draw a sample. Sampling from DPMs can be viewed alternatively as solving the corresponding diffusion ordinary differential equations (ODEs). In this work, we propose an exact formulation of the solution of diffusion ODEs. The formulation analytically computes the linear part of the solution, rather than leaving all terms to black-box ODE solvers as adopted in previous works. By applying change-of-variable, the solution can be equivalently simplified to an exponentially weighted integral of the neural network. Based on our formulation, we propose DPM-Solver, a fast dedicated high-order solver for diffusion ODEs with the convergence order guarantee. DPM-Solver is suitable for both discrete-time and continuous-time DPMs without any further training. Experimental results show that DPM-Solver can generate high-quality samples in only 10 to 20 function evaluations on various datasets. We achieve 4.70 FID in 10 function evaluations and 2.87 FID in 20 function evaluations on the CIFAR10 dataset, and a $4\sim 16\times$ speedup compared with previous state-of-the-art training-free samplers on various datasets.
    A Log-Linear Time Sequential Optimal Calibration Algorithm for Quantized Isotonic L2 Regression. (arXiv:2206.00744v1 [cs.LG])
    We study the sequential calibration of estimations in a quantized isotonic L2 regression setting. We start by showing that the optimal calibrated quantized estimations can be acquired from the traditional isotonic L2 regression solution. We modify the traditional PAVA algorithm to create calibrators for both batch and sequential optimization of the quantized isotonic regression problem. Our algorithm can update the optimal quantized monotone mapping for the samples observed so far in linear space and logarithmic time per new unordered sample.  ( 2 min )
    Improving Diffusion Models for Inverse Problems using Manifold Constraints. (arXiv:2206.00941v1 [cs.LG])
    Recently, diffusion models have been used to solve various inverse problems in an unsupervised manner with appropriate modifications to the sampling process. However, the current solvers, which recursively apply a reverse diffusion step followed by a measurement consistency step, often produce sub-optimal results. By studying the generative sampling path, here we show that current solvers throw the sample path off the data manifold, and hence the error accumulates. To address this, we propose an additional correction term inspired by the manifold constraint, which can be used synergistically with the previous solvers to make the iterations close to the manifold. The proposed manifold constraint is straightforward to implement within a few lines of code, yet boosts the performance by a surprisingly large margin. With extensive experiments, we show that our method is superior to the previous methods both theoretically and empirically, producing promising results in many applications such as image inpainting, colorization, and sparse-view computed tomography.  ( 2 min )
    Discovery of interpretable structural model errors by combining Bayesian sparse regression and data assimilation: A chaotic Kuramoto-Sivashinsky test case. (arXiv:2110.00546v2 [physics.comp-ph] UPDATED)
    Models of many engineering and natural systems are imperfect. The discrepancy between the mathematical representations of a true physical system and its imperfect model is called the model error. These model errors can lead to substantial differences between the numerical solutions of the model and the state of the system, particularly in those involving nonlinear, multi-scale phenomena. Thus, there is increasing interest in reducing model errors, particularly by leveraging the rapidly growing observational data to understand their physics and sources. Here, we introduce a framework named MEDIDA: Model Error Discovery with Interpretability and Data Assimilation. MEDIDA only requires a working numerical solver of the model and a small number of noise-free or noisy sporadic observations of the system. In MEDIDA, first the model error is estimated from differences between the observed states and model-predicted states (the latter are obtained from a number of one-time-step numerical integrations from the previous observed states). If observations are noisy, a data assimilation (DA) technique such as ensemble Kalman filter (EnKF) is employed to provide the analysis state of the system, which is then used to estimate the model error. Finally, an equation-discovery technique, here the relevance vector machine (RVM), a sparsity-promoting Bayesian method, is used to identify an interpretable, parsimonious, and closed-form representation of the model error. Using the chaotic Kuramoto-Sivashinsky (KS) system as the test case, we demonstrate the excellent performance of MEDIDA in discovering different types of structural/parametric model errors, representing different types of missing physics, using noise-free and noisy observations.  ( 2 min )
    Bayesian Model Selection, the Marginal Likelihood, and Generalization. (arXiv:2202.11678v2 [cs.LG] UPDATED)
    How do we compare between hypotheses that are entirely consistent with observations? The marginal likelihood (aka Bayesian evidence), which represents the probability of generating our observations from a prior, provides a distinctive approach to this foundational question, automatically encoding Occam's razor. Although it has been observed that the marginal likelihood can overfit and is sensitive to prior assumptions, its limitations for hyperparameter learning and discrete model comparison have not been thoroughly investigated. We first revisit the appealing properties of the marginal likelihood for learning constraints and hypothesis testing. We then highlight the conceptual and practical issues in using the marginal likelihood as a proxy for generalization. Namely, we show how marginal likelihood can be negatively correlated with generalization, with implications for neural architecture search, and can lead to both underfitting and overfitting in hyperparameter learning. We provide a partial remedy through a conditional marginal likelihood, which we show is more aligned with generalization, and practically valuable for large-scale hyperparameter learning, such as in deep kernel learning.  ( 2 min )
    Coordinated Double Machine Learning. (arXiv:2206.00885v1 [stat.ML])
    Double machine learning is a statistical method for leveraging complex black-box models to construct approximately unbiased treatment effect estimates given observational data with high-dimensional covariates, under the assumption of a partially linear model. The idea is to first fit on a subset of the samples two non-linear predictive models, one for the continuous outcome of interest and one for the observed treatment, and then to estimate a linear coefficient for the treatment using the remaining samples through a simple orthogonalized regression. While this methodology is flexible and can accommodate arbitrary predictive models, typically trained independently of one another, this paper argues that a carefully coordinated learning algorithm for deep neural networks may reduce the estimation bias. The improved empirical performance of the proposed method is demonstrated through numerical experiments on both simulated and real data.  ( 2 min )
    Offline Reinforcement Learning with Differential Privacy. (arXiv:2206.00810v1 [cs.LG])
    The offline reinforcement learning (RL) problem is often motivated by the need to learn data-driven decision policies in financial, legal and healthcare applications. However, the learned policy could retain sensitive information of individuals in the training data (e.g., treatment and outcome of patients), thus susceptible to various privacy risks. We design offline RL algorithms with differential privacy guarantees which provably prevent such risks. These algorithms also enjoy strong instance-dependent learning bounds under both tabular and linear Markov decision process (MDP) settings. Our theory and simulation suggest that the privacy guarantee comes at (almost) no drop in utility comparing to the non-private counterpart for a medium-size dataset.  ( 2 min )
    Learning Disentangled Representations for Counterfactual Regression via Mutual Information Minimization. (arXiv:2206.01022v1 [cs.LG])
    Learning individual-level treatment effect is a fundamental problem in causal inference and has received increasing attention in many areas, especially in the user growth area which concerns many internet companies. Recently, disentangled representation learning methods that decompose covariates into three latent factors, including instrumental, confounding and adjustment factors, have witnessed great success in treatment effect estimation. However, it remains an open problem how to learn the underlying disentangled factors precisely. Specifically, previous methods fail to obtain independent disentangled factors, which is a necessary condition for identifying treatment effect. In this paper, we propose Disentangled Representations for Counterfactual Regression via Mutual Information Minimization (MIM-DRCFR), which uses a multi-task learning framework to share information when learning the latent factors and incorporates MI minimization learning criteria to ensure the independence of these factors. Extensive experiments including public benchmarks and real-world industrial user growth datasets demonstrate that our method performs much better than state-of-the-art methods.  ( 2 min )
    A Confirmation of a Conjecture on the Feldman's Two-armed Bandit Problem. (arXiv:2206.00821v1 [math.ST])
    Myopic strategy is one of the most important strategies when studying bandit problems. In this paper, we consider the two-armed bandit problem proposed by Feldman. With general distributions and utility functions, we obtain a necessary and sufficient condition for the optimality of the myopic strategy. As an application, we could solve Nouiehed and Ross's conjecture for Bernoulli two-armed bandit problems that myopic strategy stochastically maximizes the number of wins.  ( 2 min )
    Defense Against Gradient Leakage Attacks via Learning to Obscure Data. (arXiv:2206.00769v1 [cs.LG])
    Federated learning is considered as an effective privacy-preserving learning mechanism that separates the client's data and model training process. However, federated learning is still under the risk of privacy leakage because of the existence of attackers who deliberately conduct gradient leakage attacks to reconstruct the client data. Recently, popular strategies such as gradient perturbation methods and input encryption methods have been proposed to defend against gradient leakage attacks. Nevertheless, these defenses can either greatly sacrifice the model performance, or be evaded by more advanced attacks. In this paper, we propose a new defense method to protect the privacy of clients' data by learning to obscure data. Our defense method can generate synthetic samples that are totally distinct from the original samples, but they can also maximally preserve their predictive features and guarantee the model performance. Furthermore, our defense strategy makes the gradient leakage attack and its variants extremely difficult to reconstruct the client data. Through extensive experiments, we show that our proposed defense method obtains better privacy protection while preserving high accuracy compared with state-of-the-art methods.  ( 2 min )
    Collaborative Learning of Distributions under Heterogeneity and Communication Constraints. (arXiv:2206.00707v1 [stat.ML])
    In modern machine learning, users often have to collaborate to learn distributions that generate the data. Communication can be a significant bottleneck. Prior work has studied homogeneous users -- i.e., whose data follow the same discrete distribution -- and has provided optimal communication-efficient methods. However, these methods rely heavily on homogeneity, and are less applicable in the common case when users' discrete distributions are heterogeneous. Here we consider a natural and tractable model of heterogeneity, where users' discrete distributions only vary sparsely, on a small number of entries. We propose a novel two-stage method named SHIFT: First, the users collaborate by communicating with the server to learn a central distribution; relying on methods from robust statistics. Then, the learned central distribution is fine-tuned to estimate the individual distributions of users. We show that SHIFT is minimax optimal in our model of heterogeneity and under communication constraints. Further, we provide experimental results using both synthetic data and $n$-gram frequency estimation in the text domain, which corroborate its efficiency.  ( 2 min )
    Split-kl and PAC-Bayes-split-kl Inequalities. (arXiv:2206.00706v1 [stat.ML])
    We present a new concentration of measure inequality for sums of independent bounded random variables, which we name a split-kl inequality. The inequality combines the combinatorial power of the kl inequality with ability to exploit low variance. While for Bernoulli random variables the kl inequality is tighter than the Empirical Bernstein, for random variables taking values inside a bounded interval and having low variance the Empirical Bernstein inequality is tighter than the kl. The proposed split-kl inequality yields the best of both worlds. We discuss an application of the split-kl inequality to bounding excess losses. We also derive a PAC-Bayes-split-kl inequality and use a synthetic example and several UCI datasets to compare it with the PAC-Bayes-kl, PAC-Bayes Empirical Bernstein, PAC-Bayes Unexpected Bernstein, and PAC-Bayes Empirical Bennett inequalities.  ( 2 min )
  • Open

    Robust Feature-Level Adversaries are Interpretability Tools. (arXiv:2110.03605v4 [cs.LG] UPDATED)
    The literature on adversarial attacks in computer vision typically focuses on pixel-level perturbations. These tend to be very difficult to interpret. Recent work that manipulates the latent representations of image generators to create "feature-level" adversarial perturbations gives us an opportunity to explore interpretable adversarial attacks. We make three contributions. First, we observe that feature-level attacks provide useful classes of inputs for studying the representations in models. Second, we show that these adversaries are versatile and highly robust. We demonstrate that they can be used to produce targeted, universal, disguised, physically-realizable, and black-box attacks at the ImageNet scale. Third, we show how these adversarial images can be used as a practical interpretability tool for identifying bugs in networks. We use these adversaries to make predictions about spurious associations between features and classes which we then test by designing "copy/paste" attacks in which one natural image is pasted into another to cause a targeted misclassification. Our results indicate that feature-level attacks are a promising approach for rigorous interpretability research. They support the design of tools to better understand what a model has learned and diagnose brittle feature associations.
    WebGPT: Browser-assisted question-answering with human feedback. (arXiv:2112.09332v3 [cs.CL] UPDATED)
    We fine-tune GPT-3 to answer long-form questions using a text-based web-browsing environment, which allows the model to search and navigate the web. By setting up the task so that it can be performed by humans, we are able to train models on the task using imitation learning, and then optimize answer quality with human feedback. To make human evaluation of factual accuracy easier, models must collect references while browsing in support of their answers. We train and evaluate our models on ELI5, a dataset of questions asked by Reddit users. Our best model is obtained by fine-tuning GPT-3 using behavior cloning, and then performing rejection sampling against a reward model trained to predict human preferences. This model's answers are preferred by humans 56% of the time to those of our human demonstrators, and 69% of the time to the highest-voted answer from Reddit.
    Metrizing Fairness. (arXiv:2205.15049v2 [cs.LG] UPDATED)
    We study supervised learning problems for predicting properties of individuals who belong to one of two demographic groups, and we seek predictors that are fair according to statistical parity. This means that the distributions of the predictions within the two groups should be close with respect to the Kolmogorov distance, and fairness is achieved by penalizing the dissimilarity of these two distributions in the objective function of the learning problem. In this paper, we showcase conceptual and computational benefits of measuring unfairness with integral probability metrics (IPMs) other than the Kolmogorov distance. Conceptually, we show that the generator of any IPM can be interpreted as a family of utility functions and that unfairness with respect to this IPM arises if individuals in the two demographic groups have diverging expected utilities. We also prove that the unfairness-regularized prediction loss admits unbiased gradient estimators if unfairness is measured by the squared $\mathcal L^2$-distance or by a squared maximum mean discrepancy. In this case, the fair learning problem is susceptible to efficient stochastic gradient descent (SGD) algorithms. Numerical experiments on real data show that these SGD algorithms outperform state-of-the-art methods for fair learning in that they achieve superior accuracy-unfairness trade-offs -- sometimes orders of magnitude faster. Finally, we identify conditions under which statistical parity can improve prediction accuracy.
    A Hybrid Spatial-temporal Deep Learning Architecture for Lane Detection. (arXiv:2110.04079v4 [cs.CV] UPDATED)
    Accurate and reliable lane detection is vital for the safe performance of lane-keeping assistance and lane departure warning systems. However, under certain challenging circumstances, it is difficult to get satisfactory performance in accurately detecting the lanes from one single image as mostly done in current literature. Since lane markings are continuous lines, the lanes that are difficult to be accurately detected in the current single image can potentially be better deduced if information from previous frames is incorporated. This study proposes a novel hybrid spatial-temporal (ST) sequence-to-one deep learning architecture. This architecture makes full use of the ST information in multiple continuous image frames to detect the lane markings in the very last frame. Specifically, the hybrid model integrates the following aspects: (a) the single image feature extraction module equipped with the spatial convolutional neural network; (b) the ST feature integration module constructed by ST recurrent neural network; (c) the encoder-decoder structure, which makes this image segmentation problem work in an end-to-end supervised learning format. Extensive experiments reveal that the proposed model architecture can effectively handle challenging driving scenes and outperforms available state-of-the-art methods.
    Pre-Trained Language Models for Interactive Decision-Making. (arXiv:2202.01771v3 [cs.LG] UPDATED)
    Language model (LM) pre-training is useful in many language processing tasks. But can pre-trained LMs be further leveraged for more general machine learning problems? We propose an approach for using LMs to scaffold learning and generalization in general sequential decision-making problems. In this approach, goals and observations are represented as a sequence of embeddings, and a policy network initialized with a pre-trained LM predicts the next action. We demonstrate that this framework enables effective combinatorial generalization across different environments and supervisory modalities. We begin by assuming access to a set of expert demonstrations, and show that initializing policies with LMs and fine-tuning them via behavior cloning improves task completion rates by 43.6% in the VirtualHome environment. We then examine how our framework may be used in environments without pre-collected expert data. To do this, we integrate an active data gathering procedure into pre-trained LMs. The agent iteratively learns by interacting with the environment, relabeling the language goal of past 'failed' experiences, and updating the policy in a self-supervised loop. The active data gathering procedure also enables effective combinatorial generalization, outperforming the best baseline by 25.1%. Finally, we explain these results by investigating three possible factors underlying the effectiveness of the LM-based policy. We find that sequential input representations (vs. fixed-dimensional feature vectors) and favorable weight initialization are both important for generalization. Surprisingly, however, the format of the policy inputs encoding (e.g. as a natural language string vs. an arbitrary sequential encoding) has little influence. Together, these results suggest that language modeling induces representations that are useful for modeling not just language, but also goals and plans.
    When does return-conditioned supervised learning work for offline reinforcement learning?. (arXiv:2206.01079v1 [cs.LG])
    Several recent works have proposed a class of algorithms for the offline reinforcement learning (RL) problem that we will refer to as return-conditioned supervised learning (RCSL). RCSL algorithms learn the distribution of actions conditioned on both the state and the return of the trajectory. Then they define a policy by conditioning on achieving high return. In this paper, we provide a rigorous study of the capabilities and limitations of RCSL, something which is crucially missing in previous work. We find that RCSL returns the optimal policy under a set of assumptions that are stronger than those needed for the more traditional dynamic programming-based algorithms. We provide specific examples of MDPs and datasets that illustrate the necessity of these assumptions and the limits of RCSL. Finally, we present empirical evidence that these limitations will also cause issues in practice by providing illustrative experiments in simple point-mass environments and on datasets from the D4RL benchmark.
    Sentiment Analysis and Effect of COVID-19 Pandemic using College SubReddit Data. (arXiv:2112.04351v2 [cs.CL] UPDATED)
    Background: The COVID-19 pandemic has affected our society and human well-being in various ways. In this study, we investigate how the pandemic has influenced people's emotions and psychological states compared to a pre-pandemic period using real-world data from social media. Method: We collected Reddit social media data from 2019 (pre-pandemic) and 2020 (pandemic) from the subreddits communities associated with eight universities. We applied the pre-trained Robustly Optimized BERT pre-training approach (RoBERTa) to learn text embedding from the Reddit messages, and leveraged the relational information among posted messages to train a graph attention network (GAT) for sentiment classification. Finally, we applied model stacking to combine the prediction probabilities from RoBERTa and GAT to yield the final classification on sentiment. With the model-predicted sentiment labels on the collected data, we used a generalized linear mixed-effects model to estimate the effects of pandemic and in-person teaching during the pandemic on sentiment. Results: The results suggest that the odds of negative sentiments in 2020 (pandemic) were 25.7% higher than the odds in 2019 (pre-pandemic) with a $p$-value $<0.001$; and the odds of negative sentiments associated in-person learning were 48.3% higher than with remote learning in 2020 with a $p$-value of 0.029. Conclusions: Our study results are consistent with the findings in the literature on the negative impacts of the pandemic on people's emotions and psychological states. Our study contributes to the growing real-world evidence on the various negative impacts of the pandemic on our society; it also provides a good example of using both ML techniques and statistical modeling and inference to make better use of real-world data.
    AQuaMoHo: Localized Low-Cost Outdoor Air Quality Sensing over a Thermo-Hygrometer. (arXiv:2204.11484v2 [cs.CY] UPDATED)
    Efficient air quality sensing serves as one of the essential services provided in any recent smart city. Mostly facilitated by sparsely deployed Air Quality Monitoring Stations (AQMSs) that are difficult to install and maintain, the overall spatial variation heavily impacts air quality monitoring for locations far enough from these pre-deployed public infrastructures. To mitigate this, we in this paper propose a framework named AQuaMoHo that can annotate data obtained from a low-cost thermo-hygrometer (as the sole physical sensing device) with the AQI labels, with the help of additional publicly crawled Spatio-temporal information of that locality. At its core, AQuaMoHo exploits the temporal patterns from a set of readily available spatial features using an LSTM-based model and further enhances the overall quality of the annotation using temporal attention. From a thorough study of two different cities, we observe that AQuaMoHo can significantly help annotate the air quality data on a personal scale.
    Evaluating Modules in Graph Contrastive Learning. (arXiv:2106.08171v2 [cs.LG] UPDATED)
    The recent emergence of contrastive learning approaches facilitates the application on graph representation learning (GRL), introducing graph contrastive learning (GCL) into the literature. These methods contrast semantically similar and dissimilar sample pairs to encode the semantics into node or graph embeddings. However, most existing works only performed \textbf{model-level} evaluation, and did not explore the combination space of modules for more comprehensive and systematic studies. For effective \textbf{module-level} evaluation, we propose a framework that decomposes GCL models into four modules: (1) a \textbf{sampler} to generate anchor, positive and negative data samples (nodes or graphs); (2) an \textbf{encoder} and a \textbf{readout} function to get sample embeddings; (3) a \textbf{discriminator} to score each sample pair (anchor-positive and anchor-negative); and (4) an \textbf{estimator} to define the loss function. Based on this framework, we conduct controlled experiments over a wide range of architectural designs and hyperparameter settings on node and graph classification tasks. Specifically, we manage to quantify the impact of a single module, investigate the interaction between modules, and compare the overall performance with current model architectures. Our key findings include a set of module-level guidelines for GCL, e.g., simple samplers from LINE and DeepWalk are strong and robust; an MLP encoder associated with Sum readout could achieve competitive performance on graph classification. Finally, we release our implementations and results as OpenGCL, a modularized toolkit that allows convenient reproduction, standard model and module evaluation, and easy extension. OpenGCL is available at \url{https://github.com/thunlp/OpenGCL}.
    Efficient $\Phi$-Regret Minimization in Extensive-Form Games via Online Mirror Descent. (arXiv:2205.15294v2 [cs.LG] UPDATED)
    A conceptually appealing approach for learning Extensive-Form Games (EFGs) is to convert them to Normal-Form Games (NFGs). This approach enables us to directly translate state-of-the-art techniques and analyses in NFGs to learning EFGs, but typically suffers from computational intractability due to the exponential blow-up of the game size introduced by the conversion. In this paper, we address this problem in natural and important setups for the \emph{$\Phi$-Hedge} algorithm -- A generic algorithm capable of learning a large class of equilibria for NFGs. We show that $\Phi$-Hedge can be directly used to learn Nash Equilibria (zero-sum settings), Normal-Form Coarse Correlated Equilibria (NFCCE), and Extensive-Form Correlated Equilibria (EFCE) in EFGs. We prove that, in those settings, the \emph{$\Phi$-Hedge} algorithms are equivalent to standard Online Mirror Descent (OMD) algorithms for EFGs with suitable dilated regularizers, and run in polynomial time. This new connection further allows us to design and analyze a new class of OMD algorithms based on modifying its log-partition function. In particular, we design an improved algorithm with balancing techniques that achieves a sharp $\widetilde{\mathcal{O}}(\sqrt{XAT})$ EFCE-regret under bandit-feedback in an EFG with $X$ information sets, $A$ actions, and $T$ episodes. To our best knowledge, this is the first such rate and matches the information-theoretic lower bound.
    Posterior Coreset Construction with Kernelized Stein Discrepancy for Model-Based Reinforcement Learning. (arXiv:2206.01162v1 [cs.LG])
    In this work, we propose a novel ${\bf K}$ernelized ${\bf S}$tein Discrepancy-based Posterior Sampling for ${\bf RL}$ algorithm (named $\texttt{KSRL}$) which extends model-based RL based upon posterior sampling (PSRL) in several ways: we (i) relax the need for any smoothness or Gaussian assumptions, allowing for complex mixture models; (ii) ensure it is applicable to large-scale training by incorporating a compression step such that the posterior consists of a \emph{Bayesian coreset} of only statistically significant past state-action pairs; and (iii) develop a novel regret analysis of PSRL based upon integral probability metrics, which, under a smoothness condition on the constructed posterior, can be evaluated in closed form as the kernelized Stein discrepancy (KSD). Consequently, we are able to improve the $\mathcal{O}(H^{3/2}d\sqrt{T})$ {regret} of PSRL to $\mathcal{O}(H^{3/2}\sqrt{T})$, where $d$ is the input dimension, $H$ is the episode length, and $T$ is the total number of episodes experienced, alleviating a linear dependence on $d$ . Moreover, we theoretically establish a trade-off between regret rate with posterior representational complexity via introducing a compression budget parameter $\epsilon$ based on KSD, and establish a lower bound on the required complexity for consistency of the model. Experimentally, we observe that this approach is competitive with several state of the art RL methodologies, with substantive improvements in computation time. Experimentally, we observe that this approach is competitive with several state of the art RL methodologies, and can achieve up-to $50\%$ reduction in wall clock time in some continuous control environments.
    DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis. (arXiv:2206.01062v1 [cs.CV])
    Accurate document layout analysis is a key requirement for high-quality PDF document conversion. With the recent availability of public, large ground-truth datasets such as PubLayNet and DocBank, deep-learning models have proven to be very effective at layout detection and segmentation. While these datasets are of adequate size to train such models, they severely lack in layout variability since they are sourced from scientific article repositories such as PubMed and arXiv only. Consequently, the accuracy of the layout segmentation drops significantly when these models are applied on more challenging and diverse layouts. In this paper, we present \textit{DocLayNet}, a new, publicly available, document-layout annotation dataset in COCO format. It contains 80863 manually annotated pages from diverse data sources to represent a wide variability in layouts. For each PDF page, the layout annotations provide labelled bounding-boxes with a choice of 11 distinct classes. DocLayNet also provides a subset of double- and triple-annotated pages to determine the inter-annotator agreement. In multiple experiments, we provide baseline accuracy scores (in mAP) for a set of popular object detection models. We also demonstrate that these models fall approximately 10\% behind the inter-annotator agreement. Furthermore, we provide evidence that DocLayNet is of sufficient size. Lastly, we compare models trained on PubLayNet, DocBank and DocLayNet, showing that layout predictions of the DocLayNet-trained models are more robust and thus the preferred choice for general-purpose document-layout analysis.
    Graph Signal Restoration Using Nested Deep Algorithm Unrolling. (arXiv:2106.15910v3 [eess.SP] UPDATED)
    Graph signal processing is a ubiquitous task in many applications such as sensor, social, transportation and brain networks, point cloud processing, and graph neural networks. Often, graph signals are corrupted in the sensing process, thus requiring restoration. In this paper, we propose two graph signal restoration methods based on deep algorithm unrolling (DAU). First, we present a graph signal denoiser by unrolling iterations of the alternating direction method of multiplier (ADMM). We then suggest a general restoration method for linear degradation by unrolling iterations of Plug-and-Play ADMM (PnP-ADMM). In the second approach, the unrolled ADMM-based denoiser is incorporated as a submodule, leading to a nested DAU structure. The parameters in the proposed denoising/restoration methods are trainable in an end-to-end manner. Our approach is interpretable and keeps the number of parameters small since we only tune graph-independent regularization parameters. We overcome two main challenges in existing graph signal restoration methods: 1) limited performance of convex optimization algorithms due to fixed parameters which are often determined manually. 2) large number of parameters of graph neural networks that result in difficulty of training. Several experiments for graph signal denoising and interpolation are performed on synthetic and real-world data. The proposed methods show performance improvements over several existing techniques in terms of root mean squared error in both tasks.
    Locating and Editing Factual Associations in GPT. (arXiv:2202.05262v3 [cs.CL] UPDATED)
    We analyze the storage and recall of factual associations in autoregressive transformer language models, finding evidence that these associations correspond to localized, directly-editable computations. We first develop a causal intervention for identifying neuron activations that are decisive in a model's factual predictions. This reveals a distinct set of steps in middle-layer feed-forward modules that mediate factual predictions while processing subject tokens. To test our hypothesis that these computations correspond to factual association recall, we modify feed-forward weights to update specific factual associations using Rank-One Model Editing (ROME). We find that ROME is effective on a standard zero-shot relation extraction (zsRE) model-editing task, comparable to existing methods. To perform a more sensitive evaluation, we also evaluate ROME on a new dataset of counterfactual assertions, on which it simultaneously maintains both specificity and generalization, whereas other methods sacrifice one or another. Our results confirm an important role for mid-layer feed-forward modules in storing factual associations and suggest that direct manipulation of computational mechanisms may be a feasible approach for model editing. The code, dataset, visualizations, and an interactive demo notebook are available at https://rome.baulab.info/
    Masked Bayesian Neural Networks : Computation and Optimality. (arXiv:2206.00853v1 [stat.ML])
    As data size and computing power increase, the architectures of deep neural networks (DNNs) have been getting more complex and huge, and thus there is a growing need to simplify such complex and huge DNNs. In this paper, we propose a novel sparse Bayesian neural network (BNN) which searches a good DNN with an appropriate complexity. We employ the masking variables at each node which can turn off some nodes according to the posterior distribution to yield a nodewise sparse DNN. We devise a prior distribution such that the posterior distribution has theoretical optimalities (i.e. minimax optimality and adaptiveness), and develop an efficient MCMC algorithm. By analyzing several benchmark datasets, we illustrate that the proposed BNN performs well compared to other existing methods in the sense that it discovers well condensed DNN architectures with similar prediction accuracy and uncertainty quantification compared to large DNNs.
    A Barrier Certificate-based Simplex Architecture with Application to Microgrids. (arXiv:2202.09710v2 [eess.SY] UPDATED)
    We present Barrier Certificate-based Simplex (BC-Simplex), a new, provably correct design for runtime assurance of continuous dynamical systems. BC-Simplex is centered around the Simplex Control Architecture, which consists of a high-performance advanced controller which is not guaranteed to maintain safety of the plant, a verified-safe baseline controller, and a decision module that switches control of the plant between the two controllers to ensure safety without sacrificing performance. In BC-Simplex, Barrier certificates are used to prove that the baseline controller ensures safety. Furthermore, BC-Simplex features a new automated method for deriving, from the barrier certificate, the conditions for switching between the controllers. Our method is based on the Taylor expansion of the barrier certificate and yields computationally inexpensive switching conditions. We consider a significant application of BC-Simplex to a microgrid featuring an advanced controller in the form of a neural network trained using reinforcement learning. The microgrid is modeled in RTDS, an industry-standard high-fidelity, real-time power systems simulator. Our results demonstrate that BC-Simplex can automatically derive switching conditions for complex systems, the switching conditions are not overly conservative, and BC-Simplex ensures safety even in the presence of adversarial attacks on the neural controller.
    Batch Normalization Is Blind to the First and Second Derivatives of the Loss. (arXiv:2205.15146v2 [cs.LG] UPDATED)
    In this paper, we prove the effects of the BN operation on the back-propagation of the first and second derivatives of the loss. When we do the Taylor series expansion of the loss function, we prove that the BN operation will block the influence of the first-order term and most influence of the second-order term of the loss. We also find that such a problem is caused by the standardization phase of the BN operation. Experimental results have verified our theoretical conclusions, and we have found that the BN operation significantly affects feature representations in specific tasks, where losses of different samples share similar analytic formulas.
    ZOOpt: Toolbox for Derivative-Free Optimization. (arXiv:1801.00329v3 [cs.LG] UPDATED)
    Recent advances in derivative-free optimization allow efficient approximation of the global-optimal solutions of sophisticated functions, such as functions with many local optima, non-differentiable and non-continuous functions. This article describes the ZOOpt (Zeroth Order Optimization) toolbox that provides efficient derivative-free solvers and is designed easy to use. ZOOpt provides single-machine parallel optimization on the basis of python core and multi-machine distributed optimization for time-consuming tasks by incorporating with the Ray framework -- a famous platform for building distributed applications. ZOOpt particularly focuses on optimization problems in machine learning, addressing high-dimensional and noisy problems such as hyper-parameter tuning and direct policy search. The toolbox is maintained toward a ready-to-use tool in real-world machine learning tasks.
    Intrinsically-Motivated Reinforcement Learning: A Brief Introduction. (arXiv:2203.02298v2 [cs.LG] UPDATED)
    Reinforcement learning (RL) is one of the three basic paradigms of machine learning. It has demonstrated impressive performance in many complex tasks like Go and StarCraft, which is increasingly involved in smart manufacturing and autonomous driving. However, RL consistently suffers from the exploration-exploitation dilemma. In this paper, we investigated the problem of improving exploration in RL and introduced the intrinsically-motivated RL. In sharp contrast to the classic exploration strategies, intrinsically-motivated RL utilizes the intrinsic learning motivation to provide sustainable exploration incentives. We carefully classified the existing intrinsic reward methods and analyzed their practical drawbacks. Moreover, we proposed a new intrinsic reward method via R\'enyi state entropy maximization, which overcomes the drawbacks of the preceding methods and provides powerful exploration incentives. Finally, extensive simulation demonstrated that the proposed module achieve superior performance with higher efficiency and robustness.
    Learning Disentangled Representations for Counterfactual Regression via Mutual Information Minimization. (arXiv:2206.01022v1 [cs.LG])
    Learning individual-level treatment effect is a fundamental problem in causal inference and has received increasing attention in many areas, especially in the user growth area which concerns many internet companies. Recently, disentangled representation learning methods that decompose covariates into three latent factors, including instrumental, confounding and adjustment factors, have witnessed great success in treatment effect estimation. However, it remains an open problem how to learn the underlying disentangled factors precisely. Specifically, previous methods fail to obtain independent disentangled factors, which is a necessary condition for identifying treatment effect. In this paper, we propose Disentangled Representations for Counterfactual Regression via Mutual Information Minimization (MIM-DRCFR), which uses a multi-task learning framework to share information when learning the latent factors and incorporates MI minimization learning criteria to ensure the independence of these factors. Extensive experiments including public benchmarks and real-world industrial user growth datasets demonstrate that our method performs much better than state-of-the-art methods.
    Super-resolving 2D stress tensor field conserving equilibrium constraints using physics informed U-Net. (arXiv:2206.01122v1 [cs.LG])
    In a finite element analysis, using a large number of grids is important to obtain accurate results, but is a resource-consuming task. Aiming to real-time simulation and optimization, it is desired to obtain fine grid analysis results within a limited resource. This paper proposes a super-resolution method that predicts a stress tensor field in a high-resolution from low-resolution contour plots by utilizing a U-Net-based neural network which is called PI-UNet. In addition, the proposed model minimizes the residual of the equilibrium constraints so that it outputs a physically reasonable solution. The proposed network is trained with FEM results of simple shapes, and is validated with a complicated realistic shape to evaluate generalization capability. Although ESRGAN is a standard model for image super-resolution, the proposed U-Net based model outperforms ESRGAN model in the stress tensor prediction task.
    Meta-SysId: A Meta-Learning Approach for Simultaneous Identification and Prediction. (arXiv:2206.00694v1 [cs.LG])
    In this paper, we propose Meta-SysId, a meta-learning approach to model sets of systems that have behavior governed by common but unknown laws and that differentiate themselves by their context. Inspired by classical modeling-and-identification approaches, Meta-SysId learns to represent the common law through shared parameters and relies on online optimization to compute system-specific context. Compared to optimization-based meta-learning methods, the separation between class parameters and context variables reduces the computational burden while allowing batch computations and a simple training scheme. We test Meta-SysId on polynomial regression, time-series prediction, model-based control, and real-world traffic prediction domains, empirically finding it outperforms or is competitive with meta-learning baselines.
    Composing a surrogate observation operator for sequential data assimilation. (arXiv:2201.12514v3 [cs.LG] UPDATED)
    In data assimilation, state estimation is not straightforward when the observation operator is unknown. This study proposes a method for composing a surrogate operator when the true operator is unknown. A neural network is used to improve the surrogate model iteratively to decrease the difference between the observations and the results of the surrogate model. A twin experiment suggests that the proposed method outperforms approaches that tentatively use a specific operator throughout the data assimilation process.
    Dynamic MRI using Learned Transform-based Deep Tensor Low-Rank Network (DTLR-Net). (arXiv:2206.00850v1 [eess.IV])
    While low-rank matrix prior has been exploited in dynamic MR image reconstruction and has obtained satisfying performance, low-rank tensors models have recently emerged as powerful alternative representations for three-dimensional dynamic MR datasets. In this paper, we introduce a model-based deep learning network by learning the tensor low-rank prior of the cardiac dynamic MR images. Instead of representing the dynamic dataset as a low-rank tensor directly, we propose a learned transformation operator to exploit the tensor low-rank property in a transform domain. In particular, by generalizing the t-SVD tensor decomposition into a unitary transformed t-SVD, we define a transformed tensor nuclear norm (TTNN) to enforce the tensor low-rankness. The dynamic MRI reconstruction problem is thus formulated using a TTNN regularized optimization problem. An iterative algorithm based on ADMM used to minimize the cost is unrolled into a deep network, where the transform is learned using convolutional neural networks (CNNs) to promote the reconstruction quality in the feature domain. Experimental results on cardiac cine MRI reconstruction demonstrate that the proposed framework is able to provide improved recovery results compared with the state-of-the-art algorithms.
    Trajectory of Mini-Batch Momentum: Batch Size Saturation and Convergence in High Dimensions. (arXiv:2206.01029v1 [math.OC])
    We analyze the dynamics of large batch stochastic gradient descent with momentum (SGD+M) on the least squares problem when both the number of samples and dimensions are large. In this setting, we show that the dynamics of SGD+M converge to a deterministic discrete Volterra equation as dimension increases, which we analyze. We identify a stability measurement, the implicit conditioning ratio (ICR), which regulates the ability of SGD+M to accelerate the algorithm. When the batch size exceeds this ICR, SGD+M converges linearly at a rate of $\mathcal{O}(1/\sqrt{\kappa})$, matching optimal full-batch momentum (in particular performing as well as a full-batch but with a fraction of the size). For batch sizes smaller than the ICR, in contrast, SGD+M has rates that scale like a multiple of the single batch SGD rate. We give explicit choices for the learning rate and momentum parameter in terms of the Hessian spectra that achieve this performance.
    Practical Adversarial Multivalid Conformal Prediction. (arXiv:2206.01067v1 [cs.LG])
    We give a simple, generic conformal prediction method for sequential prediction that achieves target empirical coverage guarantees against adversarially chosen data. It is computationally lightweight -- comparable to split conformal prediction -- but does not require having a held-out validation set, and so all data can be used for training models from which to derive a conformal score. It gives stronger than marginal coverage guarantees in two ways. First, it gives threshold calibrated prediction sets that have correct empirical coverage even conditional on the threshold used to form the prediction set from the conformal score. Second, the user can specify an arbitrary collection of subsets of the feature space -- possibly intersecting -- and the coverage guarantees also hold conditional on membership in each of these subsets. We call our algorithm MVP, short for MultiValid Prediction. We give both theory and an extensive set of empirical evaluations.
    Deep Optimal Transport for Domain Adaptation on SPD Manifolds. (arXiv:2201.05745v2 [cs.LG] UPDATED)
    The domain adaptation (DA) problem on symmetric positive definite (SPD) manifolds has raised interest in the machine learning community because of the growing potential for the SPD-matrix representations across many cross-domain applicable scenarios. However, due to the different underlying space, the previous experience and solution to the DA problem cannot benefit this new scenario directly. This study addresses a specific DA problem: the marginal and conditional distributions differ in the source and target domains on SPD manifolds. We then formalize this problem from an optimal transport perspective and derive an optimal transport framework on SPD manifolds for supervised learning. In addition, we propose a computational scheme under the optimal transport framework, Deep Optimal Transport (DOT), for general computation, using the generalized joint distribution adaptation approach and the existing Riemannian-based network architectures on SPD manifolds. DOT is applied to the real-world scenario and becomes a specific EEG-BCI classifier against the cross-session motor-imagery classification from the calibration phase to the feedback phase. In the experiments, DOT exhibits a marked improvement in the average accuracy in two highly non-stationary cross-session scenarios in the EEG-BCI classification, respectively, indicating the proposed methodology's validity.
    Weakly Supervised Representation Learning with Sparse Perturbations. (arXiv:2206.01101v1 [cs.LG])
    The theory of representation learning aims to build methods that provably invert the data generating process with minimal domain knowledge or any source of supervision. Most prior approaches require strong distributional assumptions on the latent variables and weak supervision (auxiliary information such as timestamps) to provide provable identification guarantees. In this work, we show that if one has weak supervision from observations generated by sparse perturbations of the latent variables--e.g. images in a reinforcement learning environment where actions move individual sprites--identification is achievable under unknown continuous latent distributions. We show that if the perturbations are applied only on mutually exclusive blocks of latents, we identify the latents up to those blocks. We also show that if these perturbation blocks overlap, we identify latents up to the smallest blocks shared across perturbations. Consequently, if there are blocks that intersect in one latent variable only, then such latents are identified up to permutation and scaling. We propose a natural estimation procedure based on this theory and illustrate it on low-dimensional synthetic and image-based experiments.  ( 2 min )
    Boundary Graph Neural Networks for 3D Simulations. (arXiv:2106.11299v3 [cs.LG] UPDATED)
    The abundance of data has given machine learning considerable momentum in natural sciences and engineering, though modeling of physical processes is often difficult. A particularly tough problem is the efficient representation of geometric boundaries. Triangularized geometric boundaries are well understood and ubiquitous in engineering applications. However, it is notoriously difficult to integrate them into machine learning approaches due to their heterogeneity with respect to size and orientation. In this work, we introduce an effective theory to model particle-boundary interactions, which leads to our new Boundary Graph Neural Networks (BGNNs) that dynamically modify graph structures to obey boundary conditions. The new BGNNs are tested on complex 3D granular flow processes of hoppers, rotating drums and mixers, which are all standard components of modern industrial machinery but still have complicated geometry. BGNNs are evaluated in terms of computational efficiency as well as prediction accuracy of particle flows and mixing entropies. BGNNs are able to accurately reproduce 3D granular flows within simulation uncertainties over hundreds of thousands of simulation timesteps. Most notably, in our experiments, particles stay within the geometric objects without using handcrafted conditions or restrictions.
    Counterfactual Phenotyping with Censored Time-to-Events. (arXiv:2202.11089v2 [cs.LG] UPDATED)
    Estimation of treatment efficacy of real-world clinical interventions involves working with continuous outcomes such as time-to-death, re-hospitalization, or a composite event that may be subject to censoring. Counterfactual reasoning in such scenarios requires decoupling the effects of confounding physiological characteristics that affect baseline survival rates from the effects of the interventions being assessed. In this paper, we present a latent variable approach to model heterogeneous treatment effects by proposing that an individual can belong to one of latent clusters with distinct response characteristics. We show that this latent structure can mediate the base survival rates and helps determine the effects of an intervention. We demonstrate the ability of our approach to discover actionable phenotypes of individuals based on their treatment response on multiple large randomized clinical trials originally conducted to assess appropriate treatments to reduce cardiovascular risk.
    Score-Based Generative Models Detect Manifolds. (arXiv:2206.01018v1 [stat.ML])
    Score-based generative models (SGMs) need to approximate the scores $\nabla \log p_t$ of the intermediate distributions as well as the final distribution $p_T$ of the forward process. The theoretical underpinnings of the effects of these approximations are still lacking. We find precise conditions under which SGMs are able to produce samples from an underlying (low-dimensional) data manifold $\mathcal{M}$. This assures us that SGMs are able to generate the "right kind of samples". For example, taking $\mathcal{M}$ to be the subset of images of faces, we find conditions under which the SGM robustly produces an image of a face, even though the relative frequencies of these images might not accurately represent the true data generating distribution. Moreover, this analysis is a first step towards understanding the generalization properties of SGMs: Taking $\mathcal{M}$ to be the set of all training samples, our results provide a precise description of when the SGM memorizes its training data.
    Clipped Stochastic Methods for Variational Inequalities with Heavy-Tailed Noise. (arXiv:2206.01095v1 [math.OC])
    Stochastic first-order methods such as Stochastic Extragradient (SEG) or Stochastic Gradient Descent-Ascent (SGDA) for solving smooth minimax problems and, more generally, variational inequality problems (VIP) have been gaining a lot of attention in recent years due to the growing popularity of adversarial formulations in machine learning. However, while high-probability convergence bounds are known to reflect the actual behavior of stochastic methods more accurately, most convergence results are provided in expectation. Moreover, the only known high-probability complexity results have been derived under restrictive sub-Gaussian (light-tailed) noise and bounded domain Assump. [Juditsky et al., 2011]. In this work, we prove the first high-probability complexity results with logarithmic dependence on the confidence level for stochastic methods for solving monotone and structured non-monotone VIPs with non-sub-Gaussian (heavy-tailed) noise and unbounded domains. In the monotone case, our results match the best-known ones in the light-tails case [Juditsky et al., 2011], and are novel for structured non-monotone problems such as negative comonotone, quasi-strongly monotone, and/or star-cocoercive ones. We achieve these results by studying SEG and SGDA with clipping. In addition, we numerically validate that the gradient noise of many practical GAN formulations is heavy-tailed and show that clipping improves the performance of SEG/SGDA.
    Auto-Lambda: Disentangling Dynamic Task Relationships. (arXiv:2202.03091v2 [cs.LG] UPDATED)
    Understanding the structure of multiple related tasks allows for multi-task learning to improve the generalisation ability of one or all of them. However, it usually requires training each pairwise combination of tasks together in order to capture task relationships, at an extremely high computational cost. In this work, we learn task relationships via an automated weighting framework, named Auto-Lambda. Unlike previous methods where task relationships are assumed to be fixed, Auto-Lambda is a gradient-based meta learning framework which explores continuous, dynamic task relationships via task-specific weightings, and can optimise any choice of combination of tasks through the formulation of a meta-loss; where the validation loss automatically influences task weightings throughout training. We apply the proposed framework to both multi-task and auxiliary learning problems in computer vision and robotics, and show that Auto-Lambda achieves state-of-the-art performance, even when compared to optimisation strategies designed specifically for each problem and data domain. Finally, we observe that Auto-Lambda can discover interesting learning behaviors, leading to new insights in multi-task learning. Code is available at https://github.com/lorenmt/auto-lambda.
    Predictive Multiplicity in Probabilistic Classification. (arXiv:2206.01131v1 [cs.LG])
    For a prediction task, there may exist multiple models that perform almost equally well. This multiplicity complicates how we typically develop and deploy machine learning models. We study how multiplicity affects predictions -- i.e., predictive multiplicity -- in probabilistic classification. We introduce new measures for this setting and present optimization-based methods to compute these measures for convex empirical risk minimization problems like logistic regression. We apply our methodology to gain insight into why predictive multiplicity arises. We study the incidence and prevalence of predictive multiplicity in real-world risk assessment tasks. Our results emphasize the need to report multiplicity more widely.
    Progressive Purification for Instance-Dependent Partial Label Learning. (arXiv:2206.00830v1 [cs.LG])
    Partial label learning (PLL) aims to train multi-class classifiers from instances with partial labels (PLs)-a PL for an instance is a set of candidate labels where a fixed but unknown candidate is the true label. In the last few years, the instance-independent generation process of PLs has been extensively studied, on the basis of which many practical and theoretical advances have been made in PLL, whereas relatively less attention has been paid to the practical setting of instance-dependent PLs, namely, the PL depends not only on the true label but the instance itself. In this paper, we propose a theoretically grounded and practically effective approach called PrOgressive Purification (POP) for instance-dependent PLL: in each epoch, POP updates the learning model while purifying each PL for the next epoch of the model training by progressively moving out false candidate labels. Theoretically, we prove that POP enlarges the region appropriately fast where the model is reliable, and eventually approximates the Bayes optimal classifier with mild assumptions; technically, POP is flexible with arbitrary losses and compatible with deep networks, so that the previous advanced PLL losses can be embedded in it and the performance is often significantly improved.
    Uniqueness and Complexity of Inverse MDP Models. (arXiv:2206.01192v1 [cs.LG])
    What is the action sequence aa'a" that was likely responsible for reaching state s"' (from state s) in 3 steps? Addressing such questions is important in causal reasoning and in reinforcement learning. Inverse "MDP" models p(aa'a"|ss"') can be used to answer them. In the traditional "forward" view, transition "matrix" p(s'|sa) and policy {\pi}(a|s) uniquely determine "everything": the whole dynamics p(as'a's"a"...|s), and with it, the action-conditional state process p(s's"...|saa'a"), the multi-step inverse models p(aa'a"...|ss^i), etc. If the latter is our primary concern, a natural question, analogous to the forward case is to which extent 1-step inverse model p(a|ss') plus policy {\pi}(a|s) determine the multi-step inverse models or even the whole dynamics. In other words, can forward models be inferred from inverse models or even be side-stepped. This work addresses this question and variations thereof, and also whether there are efficient decision/inference algorithms for this.
    APP: Anytime Progressive Pruning. (arXiv:2204.01640v2 [cs.LG] UPDATED)
    With the latest advances in deep learning, there has been a lot of focus on the online learning paradigm due to its relevance in practical settings. Although many methods have been investigated for optimal learning settings in scenarios where the data stream is continuous over time, sparse networks training in such settings have often been overlooked. In this paper, we explore the problem of training a neural network with a target sparsity in a particular case of online learning: the anytime learning at macroscale paradigm (ALMA). We propose a novel way of progressive pruning, referred to as \textit{Anytime Progressive Pruning} (APP); the proposed approach significantly outperforms the baseline dense and Anytime OSP models across multiple architectures and datasets under short, moderate, and long-sequence training. Our method, for example, shows an improvement in accuracy of $\approx 7\%$ and a reduction in the generalization gap by $\approx 22\%$, while being $\approx 1/3$ rd the size of the dense baseline model in few-shot restricted imagenet training. We further observe interesting nonmonotonic transitions in the generalization gap in the high number of megabatches-based ALMA. The code and experiment dashboards can be accessed at \url{https://github.com/landskape-ai/Progressive-Pruning} and \url{https://wandb.ai/landskape/APP}, respectively.
    Machine Learning-based Lung and Colon Cancer Detection using Deep Feature Extraction and Ensemble Learning. (arXiv:2206.01088v1 [eess.IV])
    Cancer is a fatal disease caused by a combination of genetic diseases and a variety of biochemical abnormalities. Lung and colon cancer have emerged as two of the leading causes of death and disability in humans. The histopathological detection of such malignancies is usually the most important component in determining the best course of action. Early detection of the ailment on either front considerably decreases the likelihood of mortality. Machine learning and deep learning techniques can be utilized to speed up such cancer detection, allowing researchers to study a large number of patients in a much shorter amount of time and at a lower cost. In this research work, we introduced a hybrid ensemble feature extraction model to efficiently identify lung and colon cancer. It integrates deep feature extraction and ensemble learning with high-performance filtering for cancer image datasets. The model is evaluated on histopathological (LC25000) lung and colon datasets. According to the study findings, our hybrid model can detect lung, colon, and (lung and colon) cancer with accuracy rates of 99.05%, 100%, and 99.30%, respectively. The study's findings show that our proposed strategy outperforms existing models significantly. Thus, these models could be applicable in clinics to support the doctor in the diagnosis of cancers.
    Automated Reinforcement Learning (AutoRL): A Survey and Open Problems. (arXiv:2201.03916v2 [cs.LG] UPDATED)
    The combination of Reinforcement Learning (RL) with deep learning has led to a series of impressive feats, with many believing (deep) RL provides a path towards generally capable agents. However, the success of RL agents is often highly sensitive to design choices in the training process, which may require tedious and error-prone manual tuning. This makes it challenging to use RL for new problems, while also limits its full potential. In many other areas of machine learning, AutoML has shown it is possible to automate such design choices and has also yielded promising initial results when applied to RL. However, Automated Reinforcement Learning (AutoRL) involves not only standard applications of AutoML but also includes additional challenges unique to RL, that naturally produce a different set of methods. As such, AutoRL has been emerging as an important area of research in RL, providing promise in a variety of applications from RNA design to playing games such as Go. Given the diversity of methods and environments considered in RL, much of the research has been conducted in distinct subfields, ranging from meta-learning to evolution. In this survey we seek to unify the field of AutoRL, we provide a common taxonomy, discuss each area in detail and pose open problems which would be of interest to researchers going forward.  ( 2 min )
    Sequential Voting with Relational Box Fields for Active Object Detection. (arXiv:2110.11524v4 [cs.CV] UPDATED)
    A key component of understanding hand-object interactions is the ability to identify the active object -- the object that is being manipulated by the human hand. In order to accurately localize the active object, any method must reason using information encoded by each image pixel, such as whether it belongs to the hand, the object, or the background. To leverage each pixel as evidence to determine the bounding box of the active object, we propose a pixel-wise voting function. Our pixel-wise voting function takes an initial bounding box as input and produces an improved bounding box of the active object as output. The voting function is designed so that each pixel inside of the input bounding box votes for an improved bounding box, and the box with the majority vote is selected as the output. We call the collection of bounding boxes generated inside of the voting function, the Relational Box Field, as it characterizes a field of bounding boxes defined in relationship to the current bounding box. While our voting function is able to improve the bounding box of the active object, one round of voting is typically not enough to accurately localize the active object. Therefore, we repeatedly apply the voting function to sequentially improve the location of the bounding box. However, since it is known that repeatedly applying a one-step predictor (i.e., auto-regressive processing with our voting function) can cause a data distribution shift, we mitigate this issue using reinforcement learning (RL). We adopt standard RL to learn the voting function parameters and show that it provides a meaningful improvement over a standard supervised learning approach. We perform experiments on two large-scale datasets: 100DOH and MECCANO, improving AP50 performance by 8% and 30%, respectively, over the state of the art.  ( 3 min )
    Adaptive Local Neighborhood-based Neural Networks for MR Image Reconstruction from Undersampled Data. (arXiv:2206.00775v1 [eess.IV])
    Recent medical image reconstruction techniques focus on generating high-quality medical images suitable for clinical use at the lowest possible cost and with the fewest possible adverse effects on patients. Recent works have shown significant promise for reconstructing MR images from sparsely sampled k-space data using deep learning. In this work, we propose a technique that rapidly estimates deep neural networks directly at reconstruction time by fitting them on small adaptively estimated neighborhoods of a training set. In brief, our algorithm alternates between searching for neighbors in a data set that are similar to the test reconstruction, and training a local network on these neighbors followed by updating the test reconstruction. Because our reconstruction model is learned on a dataset that is structurally similar to the image being reconstructed rather than being fit on a large, diverse training set, it is more adaptive to new scans. It can also handle changes in training sets and flexible scan settings, while being relatively fast. Our approach, dubbed LONDN-MRI, was validated on the FastMRI multi-coil knee data set using deep unrolled reconstruction networks. Reconstructions were performed at four fold and eight fold undersampling of k-space with 1D variable-density random phase-encode undersampling masks. Our results demonstrate that our proposed locally-trained method produces higher-quality reconstructions compared to models trained globally on larger datasets.  ( 2 min )
    Exponential Convergence Rates of Classification Errors on Learning with SGD and Random Features. (arXiv:1911.05350v2 [stat.ML] UPDATED)
    Although kernel methods are widely used in many learning problems, they have poor scalability to large datasets. To address this problem, sketching and stochastic gradient methods are the most commonly used techniques to derive efficient large-scale learning algorithms. In this study, we consider solving a binary classification problem using random features and stochastic gradient descent. In recent research, an exponential convergence rate of the expected classification error under the strong low-noise condition has been shown. We extend these analyses to a random features setting, analyzing the error induced by the approximation of random features in terms of the distance between the generated hypothesis including population risk minimizers and empirical risk minimizers when using general Lipschitz loss functions, to show that an exponential convergence of the expected classification error is achieved even if random features approximation is applied. Additionally, we demonstrate that the convergence rate does not depend on the number of features and there is a significant computational benefit in using random features in classification problems because of the strong low-noise condition.
    A Fair Comparison of Two Popular Flat Minima Optimizers: Stochastic Weight Averaging vs. Sharpness-Aware Minimization. (arXiv:2202.00661v3 [cs.LG] UPDATED)
    Recently, flat-minima optimizers, which seek to find parameters in low loss neighborhoods, have been shown to improve upon stochastic and adaptive gradient-based optimizers for training neural networks. Two methods have received significant attention due to their impressive generalization performance and scalability: 1. Stochastic Weight Averaging (SWA), and 2. Sharpness Aware Minimization (SAM). However, there has been limited investigation into their properties and no systematic benchmarking of them. Previous work mainly evaluated SWA and SAM on different architectures and datasets. We fill this gap here by comparing the loss surfaces of the models trained with each method and through a broad benchmarking across computer vision, natural language processing, and graph representation learning tasks. We discover a number of surprising findings from these results, which we hope will help researchers further improve deep learning optimizers, and practitioners identify the right optimizer for their problem.
    Understanding Nesterov's Acceleration via Proximal Point Method. (arXiv:2005.08304v3 [math.OC] UPDATED)
    The proximal point method (PPM) is a fundamental method in optimization that is often used as a building block for designing optimization algorithms. In this work, we use the PPM method to provide conceptually simple derivations along with convergence analyses of different versions of Nesterov's accelerated gradient method (AGM). The key observation is that AGM is a simple approximation of PPM, which results in an elementary derivation of the update equations and stepsizes of AGM. This view also leads to a transparent and conceptually simple analysis of AGM's convergence by using the analysis of PPM. The derivations also naturally extend to the strongly convex case. Ultimately, the results presented in this paper are of both didactic and conceptual value; they unify and explain existing variants of AGM while motivating other accelerated methods for practically relevant settings.  ( 2 min )
    Know Your Boundaries: The Necessity of Explicit Behavioral Cloning in Offline RL. (arXiv:2206.00695v1 [cs.LG])
    We introduce an offline reinforcement learning (RL) algorithm that explicitly clones a behavior policy to constrain value learning. In offline RL, it is often important to prevent a policy from selecting unobserved actions, since the consequence of these actions cannot be presumed without additional information about the environment. One straightforward way to implement such a constraint is to explicitly model a given data distribution via behavior cloning and directly force a policy not to select uncertain actions. However, many offline RL methods instantiate the constraint indirectly -- for example, pessimistic value estimation -- due to a concern about errors when modeling a potentially complex behavior policy. In this work, we argue that it is not only viable but beneficial to explicitly model the behavior policy for offline RL because the constraint can be realized in a stable way with the trained model. We first suggest a theoretical framework that allows us to incorporate behavior-cloned models into value-based offline RL methods, enjoying the strength of both explicit behavior cloning and value learning. Then, we propose a practical method utilizing a score-based generative model for behavior cloning. With the proposed method, we show state-of-the-art performance on several datasets within the D4RL and Robomimic benchmarks and achieve competitive performance across all datasets tested.
    Sequential Bayesian Neural Subnetwork Ensembles. (arXiv:2206.00794v1 [stat.ML])
    Deep neural network ensembles that appeal to model diversity have been used successfully to improve predictive performance and model robustness in several applications. Whereas, it has recently been shown that sparse subnetworks of dense models can match the performance of their dense counterparts and increase their robustness while effectively decreasing the model complexity. However, most ensembling techniques require multiple parallel and costly evaluations and have been proposed primarily with deterministic models, whereas sparsity induction has been mostly done through ad-hoc pruning. We propose sequential ensembling of dynamic Bayesian neural subnetworks that systematically reduce model complexity through sparsity-inducing priors and generate diverse ensembles in a single forward pass of the model. The ensembling strategy consists of an exploration phase that finds high-performing regions of the parameter space and multiple exploitation phases that effectively exploit the compactness of the sparse model to quickly converge to different minima in the energy landscape corresponding to high-performing subnetworks yielding diverse ensembles. We empirically demonstrate that our proposed approach surpasses the baselines of the dense frequentist and Bayesian ensemble models in prediction accuracy, uncertainty estimation, and out-of-distribution (OoD) robustness on CIFAR10, CIFAR100 datasets, and their out-of-distribution variants: CIFAR10-C, CIFAR100-C induced by corruptions. Furthermore, we found that our approach produced the most diverse ensembles compared to the approaches with a single forward pass and even compared to the approaches with multiple forward passes in some cases.
    DASO: Distribution-Aware Semantics-Oriented Pseudo-label for Imbalanced Semi-Supervised Learning. (arXiv:2106.05682v2 [cs.CV] UPDATED)
    The capability of the traditional semi-supervised learning (SSL) methods is far from real-world application due to severely biased pseudo-labels caused by (1) class imbalance and (2) class distribution mismatch between labeled and unlabeled data. This paper addresses such a relatively under-explored problem. First, we propose a general pseudo-labeling framework that class-adaptively blends the semantic pseudo-label from a similarity-based classifier to the linear one from the linear classifier, after making the observation that both types of pseudo-labels have complementary properties in terms of bias. We further introduce a novel semantic alignment loss to establish balanced feature representation to reduce the biased predictions from the classifier. We term the whole framework as Distribution-Aware Semantics-Oriented (DASO) Pseudo-label. We conduct extensive experiments in a wide range of imbalanced benchmarks: CIFAR10/100-LT, STL10-LT, and large-scale long-tailed Semi-Aves with open-set class, and demonstrate that, the proposed DASO framework reliably improves SSL learners with unlabeled data especially when both (1) class imbalance and (2) distribution mismatch dominate.  ( 2 min )
    Fictitious play in zero-sum stochastic games. (arXiv:2010.04223v6 [cs.GT] UPDATED)
    We present a novel variant of fictitious play dynamics combining classical fictitious play with Q-learning for stochastic games and analyze its convergence properties in two-player zero-sum stochastic games. Our dynamics involves players forming beliefs on the opponent strategy and their own continuation payoff (Q-function), and playing a greedy best response by using the estimated continuation payoffs. Players update their beliefs from observations of opponent actions. A key property of the learning dynamics is that update of the beliefs on Q-functions occurs at a slower timescale than update of the beliefs on strategies. We show both in the model-based and model-free cases (without knowledge of player payoff functions and state transition probabilities), the beliefs on strategies converge to a stationary mixed Nash equilibrium of the zero-sum stochastic game.
    Incorporating Explicit Uncertainty Estimates into Deep Offline Reinforcement Learning. (arXiv:2206.01085v1 [cs.LG])
    Most theoretically motivated work in the offline reinforcement learning setting requires precise uncertainty estimates. This requirement restricts the algorithms derived in that work to the tabular and linear settings where such estimates exist. In this work, we develop a novel method for incorporating scalable uncertainty estimates into an offline reinforcement learning algorithm called deep-SPIBB that extends the SPIBB family of algorithms to environments with larger state and action spaces. We use recent innovations in uncertainty estimation from the deep learning community to get more scalable uncertainty estimates to plug into deep-SPIBB. While these uncertainty estimates do not allow for the same theoretical guarantees as in the tabular case, we argue that the SPIBB mechanism for incorporating uncertainty is more robust and flexible than pessimistic approaches that incorporate the uncertainty as a value function penalty. We bear this out empirically, showing that deep-SPIBB outperforms pessimism based approaches with access to the same uncertainty estimates and performs at least on par with a variety of other strong baselines across several environments and datasets.
    Temporal Knowledge Graph Forecasting with Neural ODE. (arXiv:2101.05151v3 [cs.LG] UPDATED)
    There has been an increasing interest in inferring future links on temporal knowledge graphs (KG). While links on temporal KGs vary continuously over time, the existing approaches model the temporal KGs in discrete state spaces. To this end, we propose a novel continuum model by extending the idea of neural ordinary differential equations (ODEs) to multi-relational graph convolutional networks. The proposed model preserves the continuous nature of dynamic multi-relational graph data and encodes both temporal and structural information into continuous-time dynamic embeddings. In addition, a novel graph transition layer is applied to capture the transitions on the dynamic graph, i.e., edge formation and dissolution. We perform extensive experiments on five benchmark datasets for temporal KG reasoning, showing our model's superior performance on the future link forecasting task.
    Faster Rates of Convergence to Stationary Points in Differentially Private Optimization. (arXiv:2206.00846v1 [cs.LG])
    We study the problem of approximating stationary points of Lipschitz and smooth functions under $(\varepsilon,\delta)$-differential privacy (DP) in both the finite-sum and stochastic settings. A point $\widehat{w}$ is called an $\alpha$-stationary point of a function $F:\mathbb{R}^d\rightarrow\mathbb{R}$ if $\|\nabla F(\widehat{w})\|\leq \alpha$. We provide a new efficient algorithm that finds an $\tilde{O}\big(\big[\frac{\sqrt{d}}{n\varepsilon}\big]^{2/3}\big)$-stationary point in the finite-sum setting, where $n$ is the number of samples. This improves on the previous best rate of $\tilde{O}\big(\big[\frac{\sqrt{d}}{n\varepsilon}\big]^{1/2}\big)$. We also give a new construction that improves over the existing rates in the stochastic optimization setting, where the goal is to find approximate stationary points of the population risk. Our construction finds a $\tilde{O}\big(\frac{1}{n^{1/3}} + \big[\frac{\sqrt{d}}{n\varepsilon}\big]^{1/2}\big)$-stationary point of the population risk in time linear in $n$. Furthermore, under the additional assumption of convexity, we completely characterize the sample complexity of finding stationary points of the population risk (up to polylog factors) and show that the optimal rate on population stationarity is $\tilde \Theta\big(\frac{1}{\sqrt{n}}+\frac{\sqrt{d}}{n\varepsilon}\big)$. Finally, we show that our methods can be used to provide dimension-independent rates of $O\big(\frac{1}{\sqrt{n}}+\min\big(\big[\frac{\sqrt{rank}}{n\varepsilon}\big]^{2/3},\frac{1}{(n\varepsilon)^{2/5}}\big)\big)$ on population stationarity for Generalized Linear Models (GLM), where $rank$ is the rank of the design matrix, which improves upon the previous best known rate.
    DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps. (arXiv:2206.00927v1 [cs.LG])
    Diffusion probabilistic models (DPMs) are emerging powerful generative models. Despite their high-quality generation performance, DPMs still suffer from their slow sampling as they generally need hundreds or thousands of sequential function evaluations (steps) of large neural networks to draw a sample. Sampling from DPMs can be viewed alternatively as solving the corresponding diffusion ordinary differential equations (ODEs). In this work, we propose an exact formulation of the solution of diffusion ODEs. The formulation analytically computes the linear part of the solution, rather than leaving all terms to black-box ODE solvers as adopted in previous works. By applying change-of-variable, the solution can be equivalently simplified to an exponentially weighted integral of the neural network. Based on our formulation, we propose DPM-Solver, a fast dedicated high-order solver for diffusion ODEs with the convergence order guarantee. DPM-Solver is suitable for both discrete-time and continuous-time DPMs without any further training. Experimental results show that DPM-Solver can generate high-quality samples in only 10 to 20 function evaluations on various datasets. We achieve 4.70 FID in 10 function evaluations and 2.87 FID in 20 function evaluations on the CIFAR10 dataset, and a $4\sim 16\times$ speedup compared with previous state-of-the-art training-free samplers on various datasets.
    Unveiling The Mask of Position-Information Pattern Through the Mist of Image Features. (arXiv:2206.01202v1 [cs.CV])
    Recent studies show that paddings in convolutional neural networks encode absolute position information which can negatively affect the model performance for certain tasks. However, existing metrics for quantifying the strength of positional information remain unreliable and frequently lead to erroneous results. To address this issue, we propose novel metrics for measuring (and visualizing) the encoded positional information. We formally define the encoded information as PPP (Position-information Pattern from Padding) and conduct a series of experiments to study its properties as well as its formation. The proposed metrics measure the presence of positional information more reliably than the existing metrics based on PosENet and a test in F-Conv. We also demonstrate that for any extant (and proposed) padding schemes, PPP is primarily a learning artifact and is less dependent on the characteristics of the underlying padding schemes.
    Fast and Precise: Adjusting Planning Horizon with Adaptive Subgoal Search. (arXiv:2206.00702v1 [cs.AI])
    Complex reasoning problems contain states that vary in the computational cost required to determine a good action plan. Taking advantage of this property, we propose Adaptive Subgoal Search (AdaSubS), a search method that adaptively adjusts the planning horizon. To this end, AdaSubS generates diverse sets of subgoals at different distances. A verification mechanism is employed to filter out unreachable subgoals swiftly and thus allowing to focus on feasible further subgoals. In this way, AdaSubS benefits from the efficiency of planning with longer subgoals and the fine control with the shorter ones. We show that AdaSubS significantly surpasses hierarchical planning algorithms on three complex reasoning tasks: Sokoban, the Rubik's Cube, and inequality proving benchmark INT, setting new state-of-the-art on INT.
    On the Global Convergence Rates of Softmax Policy Gradient Methods. (arXiv:2005.06392v3 [cs.LG] UPDATED)
    We make three contributions toward better understanding policy gradient methods in the tabular setting. First, we show that with the true gradient, policy gradient with a softmax parametrization converges at a $O(1/t)$ rate, with constants depending on the problem and initialization. This result significantly expands the recent asymptotic convergence results. The analysis relies on two findings: that the softmax policy gradient satisfies a \L{}ojasiewicz inequality, and the minimum probability of an optimal action during optimization can be bounded in terms of its initial value. Second, we analyze entropy regularized policy gradient and show that it enjoys a significantly faster linear convergence rate $O(e^{-c \cdot t})$ toward softmax optimal policy $(c > 0)$. This result resolves an open question in the recent literature. Finally, combining the above two results and additional new $\Omega(1/t)$ lower bound results, we explain how entropy regularization improves policy optimization, even with the true gradient, from the perspective of convergence rate. The separation of rates is further explained using the notion of non-uniform \L{}ojasiewicz degree. These results provide a theoretical understanding of the impact of entropy and corroborate existing empirical studies.
    Core-periphery Models for Hypergraphs. (arXiv:2206.00783v1 [cs.SI])
    We introduce a random hypergraph model for core-periphery structure. By leveraging our model's sufficient statistics, we develop a novel statistical inference algorithm that is able to scale to large hypergraphs with runtime that is practically linear wrt. the number of nodes in the graph after a preprocessing step that is almost linear in the number of hyperedges, as well as a scalable sampling algorithm. Our inference algorithm is capable of learning embeddings that correspond to the reputation (rank) of a node within the hypergraph. We also give theoretical bounds on the size of the core of hypergraphs generated by our model. We experiment with hypergraph data that range to $\sim 10^5$ hyperedges mined from the Microsoft Academic Graph, Stack Exchange, and GitHub and show that our model outperforms baselines wrt. producing good fits.
    ORC: Network Group-based Knowledge Distillation using Online Role Change. (arXiv:2206.01186v1 [cs.LG])
    In knowledge distillation, since a single, omnipotent teacher network cannot solve all problems, multiple teacher-based knowledge distillations have been studied recently. However, sometimes their improvements are not as good as expected because some immature teachers may transfer the false knowledge to the student. In this paper, to overcome this limitation and take the efficacy of the multiple networks, we divide the multiple networks into teacher and student groups, respectively. That is, the student group is a set of immature networks that require learning the teacher's knowledge, while the teacher group consists of the selected networks that have performed well. Furthermore, according to our online role change strategy, the top-ranked networks in the student group are able to promote to the teacher group at every iteration and vice versa. After training the teacher group using the error images of the student group to refine the teacher group's knowledge, we transfer the collective knowledge from the teacher group to the student group successfully. We verify the superiority of the proposed method on CIFAR-10 and CIFAR-100, which achieves high performance. We further show the generality of our method with various backbone architectures such as resent, wrn, vgg, mobilenet, and shufflenet.
    Graph Autoencoders for Embedding Learning in Brain Networks and Major Depressive Disorder Identification. (arXiv:2107.12838v2 [q-bio.NC] UPDATED)
    Brain functional connectivity (FC) reveals biomarkers for identification of various neuropsychiatric disorders. Recent application of deep neural networks (DNNs) to connectome-based classification mostly relies on traditional convolutional neural networks using input connectivity matrices on a regular Euclidean grid. We propose a graph deep learning framework to incorporate the non-Euclidean information about graph structure for classifying functional magnetic resonance imaging (fMRI)-derived brain networks in major depressive disorder (MDD). We design a novel graph autoencoder (GAE) architecture based on the graph convolutional networks (GCNs) to embed the topological structure and node content of large-sized fMRI networks into low-dimensional latent representations. In network construction, we employ the Ledoit-Wolf (LDW) shrinkage method to estimate the high-dimensional FC metrics efficiently from fMRI data. We consider both supervised and unsupervised approaches for the graph embedding learning. The learned embeddings are then used as feature inputs for a deep fully-connected neural network (FCNN) to discriminate MDD from healthy controls. Evaluated on two resting-state fMRI (rs-fMRI) MDD datasets, results show that the proposed GAE-FCNN model significantly outperforms several state-of-the-art methods for brain connectome classification, achieving the best accuracy using the LDW-FC edges as node features. The graph embeddings of fMRI FC networks learned by the GAE also reveal apparent group differences between MDD and HC. Our new framework demonstrates feasibility of learning graph embeddings on brain networks to provide discriminative information for diagnosis of brain disorders.
    DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation. (arXiv:2110.02711v5 [cs.CV] UPDATED)
    Recently, GAN inversion methods combined with Contrastive Language-Image Pretraining (CLIP) enables zero-shot image manipulation guided by text prompts. However, their applications to diverse real images are still difficult due to the limited GAN inversion capability. Specifically, these approaches often have difficulties in reconstructing images with novel poses, views, and highly variable contents compared to the training data, altering object identity, or producing unwanted image artifacts. To mitigate these problems and enable faithful manipulation of real images, we propose a novel method, dubbed DiffusionCLIP, that performs text-driven image manipulation using diffusion models. Based on full inversion capability and high-quality image generation power of recent diffusion models, our method performs zero-shot image manipulation successfully even between unseen domains and takes another step towards general application by manipulating images from a widely varying ImageNet dataset. Furthermore, we propose a novel noise combination method that allows straightforward multi-attribute manipulation. Extensive experiments and human evaluation confirmed robust and superior manipulation performance of our methods compared to the existing baselines. Code is available at https://github.com/gwang-kim/DiffusionCLIP.git.
    A Log-Linear Time Sequential Optimal Calibration Algorithm for Quantized Isotonic L2 Regression. (arXiv:2206.00744v1 [cs.LG])
    We study the sequential calibration of estimations in a quantized isotonic L2 regression setting. We start by showing that the optimal calibrated quantized estimations can be acquired from the traditional isotonic L2 regression solution. We modify the traditional PAVA algorithm to create calibrators for both batch and sequential optimization of the quantized isotonic regression problem. Our algorithm can update the optimal quantized monotone mapping for the samples observed so far in linear space and logarithmic time per new unordered sample.
    Deep Transformer Q-Networks for Partially Observable Reinforcement Learning. (arXiv:2206.01078v1 [cs.LG])
    Real-world reinforcement learning tasks often involve some form of partial observability where the observations only give a partial or noisy view of the true state of the world. Such tasks typically require some form of memory, where the agent has access to multiple past observations, in order to perform well. One popular way to incorporate memory is by using a recurrent neural network to access the agent's history. However, recurrent neural networks in reinforcement learning are often fragile and difficult to train, susceptible to catastrophic forgetting and sometimes fail completely as a result. In this work, we propose Deep Transformer Q-Networks (DTQN), a novel architecture utilizing transformers and self-attention to encode an agent's history. DTQN is designed modularly, and we compare results against several modifications to our base model. Our experiments demonstrate the transformer can solve partially observable tasks faster and more stably than previous recurrent approaches.  ( 2 min )
    Query Processing on Tensor Computation Runtimes. (arXiv:2203.01877v2 [cs.DB] UPDATED)
    The huge demand for computation in artificial intelligence (AI) is driving unparalleled investments in hardware and software systems for AI. This leads to an explosion in the number of specialized hardware devices, which are now offered by major cloud vendors. By hiding the low-level complexity through a tensor-based interface, tensor computation runtimes (TCRs) such as PyTorch allow data scientists to efficiently exploit the exciting capabilities offered by the new hardware. In this paper, we explore how databases can ride the wave of innovation happening in the AI space. We design, build, and evaluate Tensor Query Processor (TQP): TQP transforms SQL queries into tensor programs and executes them on TCRs. TQP is able to run the full TPC-H benchmark by implementing novel algorithms for relational operators on the tensor routines. At the same time, TQP can support various hardware while only requiring a fraction of the usual development effort. Experiments show that TQP can improve query execution time by up to 10$\times$ over specialized CPU- and GPU-only systems. Finally, TQP can accelerate queries mixing ML predictions and SQL end-to-end, and deliver up to 9$\times$ speedup over CPU baselines.
    Improving Diffusion Models for Inverse Problems using Manifold Constraints. (arXiv:2206.00941v1 [cs.LG])
    Recently, diffusion models have been used to solve various inverse problems in an unsupervised manner with appropriate modifications to the sampling process. However, the current solvers, which recursively apply a reverse diffusion step followed by a measurement consistency step, often produce sub-optimal results. By studying the generative sampling path, here we show that current solvers throw the sample path off the data manifold, and hence the error accumulates. To address this, we propose an additional correction term inspired by the manifold constraint, which can be used synergistically with the previous solvers to make the iterations close to the manifold. The proposed manifold constraint is straightforward to implement within a few lines of code, yet boosts the performance by a surprisingly large margin. With extensive experiments, we show that our method is superior to the previous methods both theoretically and empirically, producing promising results in many applications such as image inpainting, colorization, and sparse-view computed tomography.  ( 2 min )
    Watch Out for the Safety-Threatening Actors: Proactively Mitigating Safety Hazards. (arXiv:2206.00886v1 [cs.RO])
    Despite the successful demonstration of autonomous vehicles (AVs), such as self-driving cars, ensuring AV safety remains a challenging task. Although some actors influence an AV's driving decisions more than others, current approaches pay equal attention to each actor on the road. An actor's influence on the AV's decision can be characterized in terms of its ability to decrease the number of safe navigational choices for the AV. In this work, we propose a safety threat indicator (STI) using counterfactual reasoning to estimate the importance of each actor on the road with respect to its influence on the AV's safety. We use this indicator to (i) characterize the existing real-world datasets to identify rare hazardous scenarios as well as the poor performance of existing controllers in such scenarios; and (ii) design an RL based safety mitigation controller to proactively mitigate the safety hazards those actors pose to the AV. Our approach reduces the accident rate for the state-of-the-art AV agent(s) in rare hazardous scenarios by more than 70%.
    Self-Consistency of the Fokker-Planck Equation. (arXiv:2206.00860v1 [cs.LG])
    The Fokker-Planck equation (FPE) is the partial differential equation that governs the density evolution of the It\^o process and is of great importance to the literature of statistical physics and machine learning. The FPE can be regarded as a continuity equation where the change of the density is completely determined by a time varying velocity field. Importantly, this velocity field also depends on the current density function. As a result, the ground-truth velocity field can be shown to be the solution of a fixed-point equation, a property that we call self-consistency. In this paper, we exploit this concept to design a potential function of the hypothesis velocity fields, and prove that, if such a function diminishes to zero during the training procedure, the trajectory of the densities generated by the hypothesis velocity fields converges to the solution of the FPE in the Wasserstein-2 sense. The proposed potential function is amenable to neural-network based parameterization as the stochastic gradient with respect to the parameter can be efficiently computed. Once a parameterized model, such as Neural Ordinary Differential Equation is trained, we can generate the entire trajectory to the FPE.  ( 2 min )
    $\alpha$NAS: Neural Architecture Search using Property Guided Synthesis. (arXiv:2205.03960v2 [cs.LG] UPDATED)
    In the past few years, neural architecture search (NAS) has become an increasingly important tool within the deep learning community. Despite the many recent successes of NAS, however, most existing approaches operate within highly structured design spaces, and hence explore only a small fraction of the full search space of neural architectures while also requiring significant manual effort from domain experts. In this work, we develop techniques that enable efficient NAS in a significantly larger design space. To accomplish this, we propose to perform NAS in an abstract search space of program properties. Our key insights are as follows: (1) the abstract search space is significantly smaller than the original search space, and (2) architectures with similar program properties also have similar performance; thus, we can search more efficiently in the abstract search space. To enable this approach, we also propose a novel efficient synthesis procedure, which accepts a set of promising program properties, and returns a satisfying neural architecture. We implement our approach, $\alpha$NAS, within an evolutionary framework, where the mutations are guided by the program properties. Starting with a ResNet-34 model, $\alpha$NAS produces a model with slightly improved accuracy on CIFAR-10 but 96% fewer parameters. On ImageNet, $\alpha$NAS is able to improve over Vision Transformer (30% fewer FLOPS and parameters), ResNet-50 (23% fewer FLOPS, 14% fewer parameters), and EfficientNet (7% fewer FLOPS and parameters) without any degradation in accuracy.  ( 2 min )
    NIPQ: Noise Injection Pseudo Quantization for Automated DNN Optimization. (arXiv:2206.00820v1 [cs.LG])
    The optimization of neural networks in terms of computation cost and memory footprint is crucial for their practical deployment on edge devices. In this work, we propose a novel quantization-aware training (QAT) scheme called noise injection pseudo quantization (NIPQ). NIPQ is implemented based on pseudo quantization noise (PQN) and has several advantages. First, both activation and weight can be quantized based on a unified framework. Second, the hyper-parameters of quantization (e.g., layer-wise bit-width and quantization interval) are automatically tuned. Third, after QAT, the network has robustness against quantization, thereby making it easier to deploy in practice. To validate the superiority of the proposed algorithm, we provide extensive analysis and conduct diverse experiments for various vision applications. Our comprehensive experiments validate the outstanding performance of the proposed algorithm in several aspects.
    Hyperspherical Consistency Regularization. (arXiv:2206.00845v1 [cs.LG])
    Recent advances in contrastive learning have enlightened diverse applications across various semi-supervised fields. Jointly training supervised learning and unsupervised learning with a shared feature encoder becomes a common scheme. Though it benefits from taking advantage of both feature-dependent information from self-supervised learning and label-dependent information from supervised learning, this scheme remains suffering from bias of the classifier. In this work, we systematically explore the relationship between self-supervised learning and supervised learning, and study how self-supervised learning helps robust data-efficient deep learning. We propose hyperspherical consistency regularization (HCR), a simple yet effective plug-and-play method, to regularize the classifier using feature-dependent information and thus avoid bias from labels. Specifically, HCR first projects logits from the classifier and feature projections from the projection head on the respective hypersphere, then it enforces data points on hyperspheres to have similar structures by minimizing binary cross entropy of pairwise distances' similarity metrics. Extensive experiments on semi-supervised and weakly-supervised learning demonstrate the effectiveness of our method, by showing superior performance with HCR.
    Defense Against Gradient Leakage Attacks via Learning to Obscure Data. (arXiv:2206.00769v1 [cs.LG])
    Federated learning is considered as an effective privacy-preserving learning mechanism that separates the client's data and model training process. However, federated learning is still under the risk of privacy leakage because of the existence of attackers who deliberately conduct gradient leakage attacks to reconstruct the client data. Recently, popular strategies such as gradient perturbation methods and input encryption methods have been proposed to defend against gradient leakage attacks. Nevertheless, these defenses can either greatly sacrifice the model performance, or be evaded by more advanced attacks. In this paper, we propose a new defense method to protect the privacy of clients' data by learning to obscure data. Our defense method can generate synthetic samples that are totally distinct from the original samples, but they can also maximally preserve their predictive features and guarantee the model performance. Furthermore, our defense strategy makes the gradient leakage attack and its variants extremely difficult to reconstruct the client data. Through extensive experiments, we show that our proposed defense method obtains better privacy protection while preserving high accuracy compared with state-of-the-art methods.  ( 2 min )
    Cascaded Video Generation for Videos In-the-Wild. (arXiv:2206.00735v1 [cs.CV])
    Videos can be created by first outlining a global view of the scene and then adding local details. Inspired by this idea we propose a cascaded model for video generation which follows a coarse to fine approach. First our model generates a low resolution video, establishing the global scene structure, which is then refined by subsequent cascade levels operating at larger resolutions. We train each cascade level sequentially on partial views of the videos, which reduces the computational complexity of our model and makes it scalable to high-resolution videos with many frames. We empirically validate our approach on UCF101 and Kinetics-600, for which our model is competitive with the state-of-the-art. We further demonstrate the scaling capabilities of our model and train a three-level model on the BDD100K dataset which generates 256x256 pixels videos with 48 frames.
    Revisiting the General Identifiability Problem. (arXiv:2206.01081v1 [cs.LG])
    We revisit the problem of general identifiability originally introduced in [Lee et al., 2019] for causal inference and note that it is necessary to add positivity assumption of observational distribution to the original definition of the problem. We show that without such an assumption the rules of do-calculus and consequently the proposed algorithm in [Lee et al., 2019] are not sound. Moreover, adding the assumption will cause the completeness proof in [Lee et al., 2019] to fail. Under positivity assumption, we present a new algorithm that is provably both sound and complete. A nice property of this new algorithm is that it establishes a connection between general identifiability and classical identifiability by Pearl [1995] through decomposing the general identifiability problem into a series of classical identifiability sub-problems.  ( 2 min )
    Federated Learning with a Sampling Algorithm under Isoperimetry. (arXiv:2206.00920v1 [cs.LG])
    Federated learning uses a set of techniques to efficiently distribute the training of a machine learning algorithm across several devices, who own the training data. These techniques critically rely on reducing the communication cost -- the main bottleneck -- between the devices and a central server. Federated learning algorithms usually take an optimization approach: they are algorithms for minimizing the training loss subject to communication (and other) constraints. In this work, we instead take a Bayesian approach for the training task, and propose a communication-efficient variant of the Langevin algorithm to sample a posteriori. The latter approach is more robust and provides more knowledge of the \textit{a posteriori} distribution than its optimization counterpart. We analyze our algorithm without assuming that the target distribution is strongly log-concave. Instead, we assume the weaker log Sobolev inequality, which allows for nonconvexity.  ( 2 min )
    On the Effectiveness of Knowledge Graph Embeddings: a Rule Mining Approach. (arXiv:2206.00983v1 [cs.LG])
    We study the effectiveness of Knowledge Graph Embeddings (KGE) for knowledge graph (KG) completion with rule mining. More specifically, we mine rules from KGs before and after they have been completed by a KGE to compare possible differences in the rules extracted. We apply this method to classical KGEs approaches, in particular, TransE, DistMult and ComplEx. Our experiments indicate that there can be huge differences between the extracted rules, depending on the KGE approach for KG completion. In particular, after the TransE completion, several spurious rules were extracted.  ( 2 min )
    Causal Structure Learning: a Combinatorial Perspective. (arXiv:2206.01152v1 [stat.ME])
    In this review, we discuss approaches for learning causal structure from data, also called causal discovery. In particular, we focus on approaches for learning directed acyclic graphs (DAGs) and various generalizations which allow for some variables to be unobserved in the available data. We devote special attention to two fundamental combinatorial aspects of causal structure learning. First, we discuss the structure of the search space over causal graphs. Second, we discuss the structure of equivalence classes over causal graphs, i.e., sets of graphs which represent what can be learned from observational data alone, and how these equivalence classes can be refined by adding interventional data.
    Generating Sparse Counterfactual Explanations For Multivariate Time Series. (arXiv:2206.00931v1 [cs.LG])
    Since neural networks play an increasingly important role in critical sectors, explaining network predictions has become a key research topic. Counterfactual explanations can help to understand why classifier models decide for particular class assignments and, moreover, how the respective input samples would have to be modified such that the class prediction changes. Previous approaches mainly focus on image and tabular data. In this work we propose SPARCE, a generative adversarial network (GAN) architecture that generates SPARse Counterfactual Explanations for multivariate time series. Our approach provides a custom sparsity layer and regularizes the counterfactual loss function in terms of similarity, sparsity, and smoothness of trajectories. We evaluate our approach on real-world human motion datasets as well as a synthetic time series interpretability benchmark. Although we make significantly sparser modifications than other approaches, we achieve comparable or better performance on all metrics. Moreover, we demonstrate that our approach predominantly modifies salient time steps and features, leaving non-salient inputs untouched.
    DPar2: Fast and Scalable PARAFAC2 Decomposition for Irregular Dense Tensors. (arXiv:2203.12798v2 [cs.LG] UPDATED)
    Given an irregular dense tensor, how can we efficiently analyze it? An irregular tensor is a collection of matrices whose columns have the same size and rows have different sizes from each other. PARAFAC2 decomposition is a fundamental tool to deal with an irregular tensor in applications including phenotype discovery and trend analysis. Although several PARAFAC2 decomposition methods exist, their efficiency is limited for irregular dense tensors due to the expensive computations involved with the tensor. In this paper, we propose DPar2, a fast and scalable PARAFAC2 decomposition method for irregular dense tensors. DPar2 achieves high efficiency by effectively compressing each slice matrix of a given irregular tensor, careful reordering of computations with the compression results, and exploiting the irregularity of the tensor. Extensive experiments show that DPar2 is up to 6.0x faster than competitors on real-world irregular tensors while achieving comparable accuracy. In addition, DPar2 is scalable with respect to the tensor size and target rank.
    Dataset Distillation using Neural Feature Regression. (arXiv:2206.00719v1 [cs.LG])
    Dataset distillation aims to learn a small synthetic dataset that preserves most of the information from the original dataset. Dataset distillation can be formulated as a bi-level meta-learning problem where the outer loop optimizes the meta-dataset and the inner loop trains a model on the distilled data. Meta-gradient computation is one of the key challenges in this formulation, as differentiating through the inner loop learning procedure introduces significant computation and memory costs. In this paper, we address these challenges using neural Feature Regression with Pooling (FRePo), achieving the state-of-the-art performance with an order of magnitude less memory requirement and two orders of magnitude faster training than previous methods. The proposed algorithm is analogous to truncated backpropagation through time with a pool of models to alleviate various types of overfitting in dataset distillation. FRePo significantly outperforms the previous methods on CIFAR100, Tiny ImageNet, and ImageNet-1K. Furthermore, we show that high-quality distilled data can greatly improve various downstream applications, such as continual learning and membership inference defense.
    Federated Learning under Distributed Concept Drift. (arXiv:2206.00799v1 [cs.LG])
    Federated Learning (FL) under distributed concept drift is a largely unexplored area. Although concept drift is itself a well-studied phenomenon, it poses particular challenges for FL, because drifts arise staggered in time and space (across clients). Our work is the first to explicitly study data heterogeneity in both dimensions. We first demonstrate that prior solutions to drift adaptation, with their single global model, are ill-suited to staggered drifts, necessitating multi-model solutions. We identify the problem of drift adaptation as a time-varying clustering problem, and we propose two new clustering algorithms for reacting to drifts based on local drift detection and hierarchical clustering. Empirical evaluation shows that our solutions achieve significantly higher accuracy than existing baselines, and are comparable to an idealized algorithm with oracle knowledge of the ground-truth clustering of clients to concepts at each time step.  ( 2 min )
    Learning code summarization from a small and local dataset. (arXiv:2206.00804v1 [cs.SE])
    Foundation models (e.g., CodeBERT, GraphCodeBERT, CodeT5) work well for many software engineering tasks. These models are pre-trained (using self-supervision) with billions of code tokens, and then fine-tuned with hundreds of thousands of labeled examples, typically drawn from many projects. However, software phenomena can be very project-specific. Vocabulary, and other phenomena vary substantially with each project. Thus, training on project-specific data, and testing on the same project, is a promising idea. This hypothesis has to be evaluated carefully, e.g., in a time-series setting, to prevent training-test leakage. We compare several models and training approaches, including same-project training, cross-project training, training a model especially designed to be sample efficient (and thus prima facie well-suited for learning in a limited-sample same-project setting) and a maximalist hybrid approach, fine-tuning first on many projects in many languages and then training on the same-project. We find that the maximalist hybrid setting provides consistent, substantial gains over the state-of-the-art, on many different projects in both Java and Python.  ( 2 min )
    Collaboration Equilibrium in Federated Learning. (arXiv:2108.07926v3 [cs.LG] UPDATED)
    Federated learning (FL) refers to the paradigm of learning models over a collaborative research network involving multiple clients without sacrificing privacy. Recently, there have been rising concerns on the distributional discrepancies across different clients, which could even cause counterproductive consequences when collaborating with others. While it is not necessarily that collaborating with all clients will achieve the best performance, in this paper, we study a rational collaboration called ``collaboration equilibrium'' (CE), where smaller collaboration coalitions are formed. Each client collaborates with certain members who maximally improve the model learning and isolates the others who make little contribution. We propose the concept of benefit graph which describes how each client can benefit from collaborating with other clients and advance a Pareto optimization approach to identify the optimal collaborators. Then we theoretically prove that we can reach a CE from the benefit graph through an iterative graph operation. Our framework provides a new way of setting up collaborations in a research network. Experiments on both synthetic and real world data sets are provided to demonstrate the effectiveness of our method.
    Residual Multiplicative Filter Networks for Multiscale Reconstruction. (arXiv:2206.00746v1 [cs.CV])
    Coordinate networks like Multiplicative Filter Networks (MFNs) and BACON offer some control over the frequency spectrum used to represent continuous signals such as images or 3D volumes. Yet, they are not readily applicable to problems for which coarse-to-fine estimation is required, including various inverse problems in which coarse-to-fine optimization plays a key role in avoiding poor local minima. We introduce a new coordinate network architecture and training scheme that enables coarse-to-fine optimization with fine-grained control over the frequency support of learned reconstructions. This is achieved with two key innovations. First, we incorporate skip connections so that structure at one scale is preserved when fitting finer-scale structure. Second, we propose a novel initialization scheme to provide control over the model frequency spectrum at each stage of optimization. We demonstrate how these modifications enable multiscale optimization for coarse-to-fine fitting to natural images. We then evaluate our model on synthetically generated datasets for the the problem of single-particle cryo-EM reconstruction. We learn high resolution multiscale structures, on par with the state-of-the art.
    Availability Attacks Create Shortcuts. (arXiv:2111.00898v2 [cs.LG] UPDATED)
    Availability attacks, which poison the training data with imperceptible perturbations, can make the data \emph{not exploitable} by machine learning algorithms so as to prevent unauthorized use of data. In this work, we investigate why these perturbations work in principle. We are the first to unveil an important population property of the perturbations of these attacks: they are almost \textbf{linearly separable} when assigned with the target labels of the corresponding samples, which hence can work as \emph{shortcuts} for the learning objective. We further verify that linear separability is indeed the workhorse for availability attacks. We synthesize linearly-separable perturbations as attacks and show that they are as powerful as the deliberately crafted attacks. Moreover, such synthetic perturbations are much easier to generate. For example, previous attacks need dozens of hours to generate perturbations for ImageNet while our algorithm only needs several seconds. Our finding also suggests that the \emph{shortcut learning} is more widely present than previously believed as deep models would rely on shortcuts even if they are of an imperceptible scale and mixed together with the normal features. Our source code is published at \url{https://github.com/dayu11/Availability-Attacks-Create-Shortcuts}.
    Shortest Path Networks for Graph Property Prediction. (arXiv:2206.01003v1 [cs.LG])
    Most graph neural network models rely on a particular message passing paradigm, where the idea is to iteratively propagate node representations of a graph to each node in the direct neighborhood. While very prominent, this paradigm leads to information propagation bottlenecks, as information is repeatedly compressed at intermediary node representations, which causes loss of information, making it practically impossible to gather meaningful signals from distant nodes. To address this issue, we propose shortest path message passing neural networks, where the node representations of a graph are propagated to each node in the shortest path neighborhoods. In this setting, nodes can directly communicate between each other even if they are not neighbors, breaking the information bottleneck and hence leading to more adequately learned representations. Theoretically, our framework generalizes message passing neural networks, resulting in provably more expressive models. Empirically, we verify the capacity of a basic model of this framework on dedicated synthetic experiments, and on real-world graph classification and regression benchmarks, obtaining several state-of-the-art results.
    The Phenomenon of Policy Churn. (arXiv:2206.00730v1 [cs.LG])
    We identify and study the phenomenon of policy churn, that is, the rapid change of the greedy policy in value-based reinforcement learning. Policy churn operates at a surprisingly rapid pace, changing the greedy action in a large fraction of states within a handful of learning updates (in a typical deep RL set-up such as DQN on Atari). We characterise the phenomenon empirically, verifying that it is not limited to specific algorithm or environment properties. A number of ablations help whittle down the plausible explanations on why churn occurs to just a handful, all related to deep learning. Finally, we hypothesise that policy churn is a beneficial but overlooked form of implicit exploration that casts $\epsilon$-greedy exploration in a fresh light, namely that $\epsilon$-noise plays a much smaller role than expected.
    Towards real-world navigation with deep differentiable planners. (arXiv:2108.05713v2 [cs.RO] UPDATED)
    We train embodied neural networks to plan and navigate unseen complex 3D environments, emphasising real-world deployment. Rather than requiring prior knowledge of the agent or environment, the planner learns to model the state transitions and rewards. To avoid the potentially hazardous trial-and-error of reinforcement learning, we focus on differentiable planners such as Value Iteration Networks (VIN), which are trained offline from safe expert demonstrations. Although they work well in small simulations, we address two major limitations that hinder their deployment. First, we observed that current differentiable planners struggle to plan long-term in environments with a high branching complexity. While they should ideally learn to assign low rewards to obstacles to avoid collisions, we posit that the constraints imposed on the network are not strong enough to guarantee the network to learn sufficiently large penalties for every possible collision. We thus impose a structural constraint on the value iteration, which explicitly learns to model any impossible actions. Secondly, we extend the model to work with a limited perspective camera under translation and rotation, which is crucial for real robot deployment. Many VIN-like planners assume a 360 degrees or overhead view without rotation. In contrast, our method uses a memory-efficient lattice map to aggregate CNN embeddings of partial observations, and models the rotational dynamics explicitly using a 3D state-space grid (translation and rotation). Our proposals significantly improve semantic navigation and exploration on several 2D and 3D environments, succeeding in settings that are otherwise challenging for this class of methods. As far as we know, we are the first to successfully perform differentiable planning on the difficult Active Vision Dataset, consisting of real images captured from a robot.
    Offline Reinforcement Learning with Differential Privacy. (arXiv:2206.00810v1 [cs.LG])
    The offline reinforcement learning (RL) problem is often motivated by the need to learn data-driven decision policies in financial, legal and healthcare applications. However, the learned policy could retain sensitive information of individuals in the training data (e.g., treatment and outcome of patients), thus susceptible to various privacy risks. We design offline RL algorithms with differential privacy guarantees which provably prevent such risks. These algorithms also enjoy strong instance-dependent learning bounds under both tabular and linear Markov decision process (MDP) settings. Our theory and simulation suggest that the privacy guarantee comes at (almost) no drop in utility comparing to the non-private counterpart for a medium-size dataset.  ( 2 min )
    Approximate Network Motif Mining Via Graph Learning. (arXiv:2206.01008v1 [cs.LG])
    Frequent and structurally related subgraphs, also known as network motifs, are valuable features of many graph datasets. However, the high computational complexity of identifying motif sets in arbitrary datasets (motif mining) has limited their use in many real-world datasets. By automatically leveraging statistical properties of datasets, machine learning approaches have shown promise in several tasks with combinatorial complexity and are therefore a promising candidate for network motif mining. In this work we seek to facilitate the development of machine learning approaches aimed at motif mining. We propose a formulation of the motif mining problem as a node labelling task. In addition, we build benchmark datasets and evaluation metrics which test the ability of models to capture different aspects of motif discovery such as motif number, size, topology, and scarcity. Next, we propose MotiFiesta, a first attempt at solving this problem in a fully differentiable manner with promising results on challenging baselines. Finally, we demonstrate through MotiFiesta that this learning setting can be applied simultaneously to general-purpose data mining and interpretable feature extraction for graph classification tasks.
    Training privacy-preserving video analytics pipelines by suppressing features that reveal information about private attributes. (arXiv:2203.02635v2 [cs.CV] UPDATED)
    Deep neural networks are increasingly deployed for scene analytics, including to evaluate the attention and reaction of people exposed to out-of-home advertisements. However, the features extracted by a deep neural network that was trained to predict a specific, consensual attribute (e.g. emotion) may also encode and thus reveal information about private, protected attributes (e.g. age or gender). In this work, we focus on such leakage of private information at inference time. We consider an adversary with access to the features extracted by the layers of a deployed neural network and use these features to predict private attributes. To prevent the success of such an attack, we modify the training of the network using a confusion loss that encourages the extraction of features that make it difficult for the adversary to accurately predict private attributes. We validate this training approach on image-based tasks using a publicly available dataset. Results show that, compared to the original network, the proposed PrivateNet can reduce the leakage of private information of a state-of-the-art emotion recognition classifier by 2.88% for gender and by 13.06% for age group, with a minimal effect on task accuracy.
    SolarGAN: Synthetic Annual Solar Irradiance Time Series on Urban Building Facades via Deep Generative Networks. (arXiv:2206.00747v1 [cs.LG])
    Building Integrated Photovoltaics (BIPV) is a promising technology to decarbonize urban energy systems via harnessing solar energy available on building envelopes. While methods to assess solar irradiation, especially on rooftops, are well established, the assessment on building facades usually involves a higher effort due to more complex urban features and obstructions. The drawback of existing physics-based simulation programs is that they require significant manual modelling effort and computing time for generating time resolved deterministic results. Yet, solar irradiation is highly intermittent and representing its inherent uncertainty may be required for designing robust BIPV energy systems. Targeting on these drawbacks, this paper proposes a data-driven model based on Deep Generative Networks (DGN) to efficiently generate high-fidelity stochastic ensembles of annual hourly solar irradiance time series on building facades with uncompromised spatiotemporal resolution at the urban scale. The only input required is easily obtainable, simple fisheye images as categorical shading masks captured from 3D models. In principle, even actual photographs of urban contexts can be utilized, given they are semantically segmented. Our validations exemplify the high fidelity of the generated time series when compared to the physics-based simulator. To demonstrate the model's relevance for urban energy planning, we showcase its potential for generative design by parametrically altering characteristic features of the urban environment and producing corresponding time series on building facades under different climatic contexts in real-time.
    Multi-source Domain Adaptation via Weighted Joint Distributions Optimal Transport. (arXiv:2006.12938v2 [cs.LG] UPDATED)
    The problem of domain adaptation on an unlabeled target dataset using knowledge from multiple labelled source datasets is becoming increasingly important. A key challenge is to design an approach that overcomes the covariate and target shift both among the sources, and between the source and target domains. In this paper, we address this problem from a new perspective: instead of looking for a latent representation invariant between source and target domains, we exploit the diversity of source distributions by tuning their weights to the target task at hand. Our method, named Weighted Joint Distribution Optimal Transport (WJDOT), aims at finding simultaneously an Optimal Transport-based alignment between the source and target distributions and a re-weighting of the sources distributions. We discuss the theoretical aspects of the method and propose a conceptually simple algorithm. Numerical experiments indicate that the proposed method achieves state-of-the-art performance on simulated and real-life datasets.
    RoCourseNet: Distributionally Robust Training of a Prediction Aware Recourse Model. (arXiv:2206.00700v1 [cs.LG])
    Counterfactual (CF) explanations for machine learning (ML) models are preferred by end-users, as they explain the predictions of ML models by providing a recourse case to individuals who are adversely impacted by predicted outcomes. Existing CF explanation methods generate recourses under the assumption that the underlying target ML model remains stationary over time. However, due to commonly occurring distributional shifts in training data, ML models constantly get updated in practice, which might render previously generated recourses invalid and diminish end-users trust in our algorithmic framework. To address this problem, we propose RoCourseNet, a training framework that jointly optimizes for predictions and robust recourses to future data shifts. We have three main contributions: (i) We propose a novel virtual data shift (VDS) algorithm to find worst-case shifted ML models by explicitly considering the worst-case data shift in the training dataset. (ii) We leverage adversarial training to solve a novel tri-level optimization problem inside RoCourseNet, which simultaneously generates predictions and corresponding robust recourses. (iii) Finally, we evaluate RoCourseNet's performance on three real-world datasets and show that RoCourseNet outperforms state-of-the-art baselines by 10% in generating robust CF explanations.
    Invertible Neural Networks for Graph Prediction. (arXiv:2206.01163v1 [stat.ML])
    In this work, we address conditional generation using deep invertible neural networks. This is a type of problem where one aims to infer the most probable inputs $X$ given outcomes $Y$. We call our method \textit{invertible graph neural network} (iGNN) due to the primary focus on generating node features on graph data. A notable feature of our proposed methods is that during network training, we revise the typically-used loss objective in normalizing flow and consider Wasserstein-2 regularization to facilitate the training process. Algorithmic-wise, we adopt an end-to-end training approach since our objective is to address prediction and generation in the forward and backward processes at once through a single model. Theoretically, we characterize the conditions for identifiability of a true mapping, the existence and invertibility of the mapping, and the expressiveness of iGNN in learning the mapping. Experimentally, we verify the performance of iGNN on both simulated and real-data datasets. We demonstrate through extensive numerical experiments that iGNN shows clear improvement over competing conditional generation benchmarks on high-dimensional and/or non-convex data.
    Regularized Nonlinear Regression for Simultaneously Selecting and Estimating Key Model Parameters. (arXiv:2104.11426v2 [stat.ME] UPDATED)
    In system identification, estimating parameters of a model using limited observations results in poor identifiability. To cope with this issue, we propose a new method to simultaneously select and estimate sensitive parameters as key model parameters and fix the remaining parameters to a set of typical values. Our method is formulated as a nonlinear least squares estimator with L1-regularization on the deviation of parameters from a set of typical values. First, we provide consistency and oracle properties of the proposed estimator as a theoretical foundation. Second, we provide a novel approach based on Levenberg-Marquardt optimization to numerically find the solution to the formulated problem. Third, to show the effectiveness, we present an application identifying a biomechanical parametric model of a head position tracking task for 10 human subjects from limited data. In a simulation study, the variances of estimated parameters are decreased by 96.1% as compared to that of the estimated parameters without L1-regularization. In an experimental study, our method improves the model interpretation by reducing the number of parameters to be estimated while maintaining variance accounted for (VAF) at above 82.5%. Moreover, the variances of estimated parameters are reduced by 71.1% as compared to that of the estimated parameters without L1-regularization. Our method is 54 times faster than the standard simplex-based optimization to solve the regularized nonlinear regression.
    Why Did This Model Forecast This Future? Closed-Form Temporal Saliency Towards Causal Explanations of Probabilistic Forecasts. (arXiv:2206.00679v1 [cs.LG])
    Forecasting tasks surrounding the dynamics of low-level human behavior are of significance to multiple research domains. In such settings, methods for explaining specific forecasts can enable domain experts to gain insights into the predictive relationships between behaviors. In this work, we introduce and address the following question: given a probabilistic forecasting model how can we identify observed windows that the model considers salient when making its forecasts? We build upon a general definition of information-theoretic saliency grounded in human perception and extend it to forecasting settings by leveraging a crucial attribute of the domain: a single observation can result in multiple valid futures. We propose to express the saliency of an observed window in terms of the differential entropy of the resulting predicted future distribution. In contrast to existing methods that either require explicit training of the saliency mechanism or access to the internal states of the forecasting model, we obtain a closed-form solution for the saliency map for commonly used density functions in probabilistic forecasting. We empirically demonstrate how our framework can recover salient observed windows from head pose features for the sample task of speaking-turn forecasting using a synthesized conversation dataset.
    Walk for Learning: A Random Walk Approach for Federated Learning from Heterogeneous Data. (arXiv:2206.00737v1 [cs.LG])
    We consider the problem of a Parameter Server (PS) that wishes to learn a model that fits data distributed on the nodes of a graph. We focus on Federated Learning (FL) as a canonical application. One of the main challenges of FL is the communication bottleneck between the nodes and the parameter server. A popular solution in the literature is to allow each node to do several local updates on the model in each iteration before sending it back to the PS. While this mitigates the communication bottleneck, the statistical heterogeneity of the data owned by the different nodes has proven to delay convergence and bias the model. In this work, we study random walk (RW) learning algorithms for tackling the communication and data heterogeneity problems. The main idea is to leverage available direct connections among the nodes themselves, which are typically "cheaper" than the communication to the PS. In a random walk, the model is thought of as a "baton" that is passed from a node to one of its neighbors after being updated in each iteration. The challenge in designing the RW is the data heterogeneity and the uncertainty about the data distributions. Ideally, we would want to visit more often nodes that hold more informative data. We cast this problem as a sleeping multi-armed bandit (MAB) to design a near-optimal node sampling strategy that achieves variance-reduced gradient estimates and approaches sub-linearly the optimal sampling strategy. Based on this framework, we present an adaptive random walk learning algorithm. We provide theoretical guarantees on its convergence. Our numerical results validate our theoretical findings and show that our algorithm outperforms existing random walk algorithms.
    A Communication-efficient Algorithm with Linear Convergence for Federated Minimax Learning. (arXiv:2206.01132v1 [cs.LG])
    In this paper, we study a large-scale multi-agent minimax optimization problem, which models many interesting applications in statistical learning and game theory, including Generative Adversarial Networks (GANs). The overall objective is a sum of agents' private local objective functions. We first analyze an important special case, empirical minimax problem, where the overall objective approximates a true population minimax risk by statistical samples. We provide generalization bounds for learning with this objective through Rademacher complexity analysis. Then, we focus on the federated setting, where agents can perform local computation and communicate with a central server. Most existing federated minimax algorithms either require communication per iteration or lack performance guarantees with the exception of Local Stochastic Gradient Descent Ascent (SGDA), a multiple-local-update descent ascent algorithm which guarantees convergence under a diminishing stepsize. By analyzing Local SGDA under the ideal condition of no gradient noise, we show that generally it cannot guarantee exact convergence with constant stepsizes and thus suffers from slow rates of convergence. To tackle this issue, we propose FedGDA-GT, an improved Federated (Fed) Gradient Descent Ascent (GDA) method based on Gradient Tracking (GT). When local objectives are Lipschitz smooth and strongly-convex-strongly-concave, we prove that FedGDA-GT converges linearly with a constant stepsize to global $\epsilon$-approximation solution with $\mathcal{O}(\log (1/\epsilon))$ rounds of communication, which matches the time complexity of centralized GDA method. Finally, we numerically show that FedGDA-GT outperforms Local SGDA.
    Split-kl and PAC-Bayes-split-kl Inequalities. (arXiv:2206.00706v1 [stat.ML])
    We present a new concentration of measure inequality for sums of independent bounded random variables, which we name a split-kl inequality. The inequality combines the combinatorial power of the kl inequality with ability to exploit low variance. While for Bernoulli random variables the kl inequality is tighter than the Empirical Bernstein, for random variables taking values inside a bounded interval and having low variance the Empirical Bernstein inequality is tighter than the kl. The proposed split-kl inequality yields the best of both worlds. We discuss an application of the split-kl inequality to bounding excess losses. We also derive a PAC-Bayes-split-kl inequality and use a synthetic example and several UCI datasets to compare it with the PAC-Bayes-kl, PAC-Bayes Empirical Bernstein, PAC-Bayes Unexpected Bernstein, and PAC-Bayes Empirical Bennett inequalities.
    Leveraging Systematic Knowledge of 2D Transformations. (arXiv:2206.00893v1 [cs.CV])
    The existing deep learning models suffer from out-of-distribution (o.o.d.) performance drop in computer vision tasks. In comparison, humans have a remarkable ability to interpret images, even if the scenes in the images are rare, thanks to the systematicity of acquired knowledge. This work focuses on 1) the acquisition of systematic knowledge of 2D transformations, and 2) architectural components that can leverage the learned knowledge in image classification tasks in an o.o.d. setting. With a new training methodology based on synthetic datasets that are constructed under the causal framework, the deep neural networks acquire knowledge from semantically different domains (e.g. even from noise), and exhibit certain level of systematicity in parameter estimation experiments. Based on this, a novel architecture is devised consisting of a classifier, an estimator and an identifier (abbreviated as "CED"). By emulating the "hypothesis-verification" process in human visual perception, CED improves the classification accuracy significantly on test sets under covariate shift.
    Dataset Condensation via Efficient Synthetic-Data Parameterization. (arXiv:2205.14959v2 [cs.LG] UPDATED)
    The great success of machine learning with massive amounts of data comes at a price of huge computation costs and storage for training and tuning. Recent studies on dataset condensation attempt to reduce the dependence on such massive data by synthesizing a compact training dataset. However, the existing approaches have fundamental limitations in optimization due to the limited representability of synthetic datasets without considering any data regularity characteristics. To this end, we propose a novel condensation framework that generates multiple synthetic data with a limited storage budget via efficient parameterization considering data regularity. We further analyze the shortcomings of the existing gradient matching-based condensation methods and develop an effective optimization technique for improving the condensation of training data information. We propose a unified algorithm that drastically improves the quality of condensed data against the current state-of-the-art on CIFAR-10, ImageNet, and Speech Commands.
    Neural Decoding with Optimization of Node Activations. (arXiv:2206.00786v1 [cs.IT])
    The problem of maximum likelihood decoding with a neural decoder for error-correcting code is considered. It is shown that the neural decoder can be improved with two novel loss terms on the node's activations. The first loss term imposes a sparse constraint on the node's activations. Whereas, the second loss term tried to mimic the node's activations from a teacher decoder which has better performance. The proposed method has the same run time complexity and model size as the neural Belief Propagation decoder, while improving the decoding performance by up to $1.1dB$ on BCH codes.
    Feature Space Particle Inference for Neural Network Ensembles. (arXiv:2206.00944v1 [cs.LG])
    Ensembles of deep neural networks demonstrate improved performance over single models. For enhancing the diversity of ensemble members while keeping their performance, particle-based inference methods offer a promising approach from a Bayesian perspective. However, the best way to apply these methods to neural networks is still unclear: seeking samples from the weight-space posterior suffers from inefficiency due to the over-parameterization issues, while seeking samples directly from the function-space posterior often results in serious underfitting. In this study, we propose optimizing particles in the feature space where the activation of a specific intermediate layer lies to address the above-mentioned difficulties. Our method encourages each member to capture distinct features, which is expected to improve ensemble prediction robustness. Extensive evaluation on real-world datasets shows that our model significantly outperforms the gold-standard Deep Ensembles on various metrics, including accuracy, calibration, and robustness. Code is available at https://github.com/DensoITLab/featurePI .
    DNN-assisted Particle-based Bayesian Joint Synchronization and Localization. (arXiv:2110.02771v2 [cs.IT] UPDATED)
    In this work, we propose a Deep neural network-assisted Particle Filter-based (DePF) approach to address the Mobile User (MU) joint synchronization and localization (sync\&loc) problem in ultra dense networks. In particular, DePF deploys an asymmetric time-stamp exchange mechanism between the MUs and the Access Points (APs), which, traditionally, provides us with information about the MUs' clock offset and skew. However, information about the distance between an AP and an MU is also intrinsic to the propagation delay experienced by exchanged time-stamps. In addition, to estimate the angle of arrival of the received synchronization packet, DePF draws on the multiple signal classification algorithm that is fed by Channel Impulse Response (CIR) experienced by the sync packets. The CIR is also leveraged on to determine the link condition, i.e. Line-of-Sight (LoS) or Non-LoS. Finally, to perform joint sync\&loc, DePF capitalizes on particle Gaussian mixtures that allow for a hybrid particle-based and parametric Bayesian Recursive Filtering (BRF) fusion of the aforementioned pieces of information and thus jointly estimate the position and clock parameters of the MUs. The simulation results verifies the superiority of the proposed algorithm over the state-of-the-art schemes, especially that of Extended Kalman filter- and linearized BRF-based joint sync\&loc. In particular, only drawing on the synchronization time-stamp exchange and CIRs, for 90$\%$of the cases, the absolute position and clock offset estimation error remain below 1 meter and 2 nanoseconds, respectively.
    On the reversibility of adversarial attacks. (arXiv:2206.00772v1 [cs.LG])
    Adversarial attacks modify images with perturbations that change the prediction of classifiers. These modified images, known as adversarial examples, expose the vulnerabilities of deep neural network classifiers. In this paper, we investigate the predictability of the mapping between the classes predicted for original images and for their corresponding adversarial examples. This predictability relates to the possibility of retrieving the original predictions and hence reversing the induced misclassification. We refer to this property as the reversibility of an adversarial attack, and quantify reversibility as the accuracy in retrieving the original class or the true class of an adversarial example. We present an approach that reverses the effect of an adversarial attack on a classifier using a prior set of classification results. We analyse the reversibility of state-of-the-art adversarial attacks on benchmark classifiers and discuss the factors that affect the reversibility.
    Anarchic Federated Learning. (arXiv:2108.09875v2 [cs.LG] UPDATED)
    Present-day federated learning (FL) systems deployed over edge networks consists of a large number of workers with high degrees of heterogeneity in data and/or computing capabilities, which call for flexible worker participation in terms of timing, effort, data heterogeneity, etc. To satisfy the need for flexible worker participation, we consider a new FL paradigm called "Anarchic Federated Learning" (AFL) in this paper. In stark contrast to conventional FL models, each worker in AFL has the freedom to choose i) when to participate in FL, and ii) the number of local steps to perform in each round based on its current situation (e.g., battery level, communication channels, privacy concerns). However, such chaotic worker behaviors in AFL impose many new open questions in algorithm design. In particular, it remains unclear whether one could develop convergent AFL training algorithms, and if yes, under what conditions and how fast the achievable convergence speed is. Toward this end, we propose two Anarchic Federated Averaging (AFA) algorithms with two-sided learning rates for both cross-device and cross-silo settings, which are named AFA-CD and AFA-CS, respectively. Somewhat surprisingly, we show that, under mild anarchic assumptions, both AFL algorithms achieve the best known convergence rate as the state-of-the-art algorithms for conventional FL. Moreover, they retain the highly desirable {\em linear speedup effect} with respect of both the number of workers and local steps in the new AFL paradigm. We validate the proposed algorithms with extensive experiments on real-world datasets.
    Leveraging Non-uniformity in First-order Non-convex Optimization. (arXiv:2105.06072v3 [cs.LG] UPDATED)
    Classical global convergence results for first-order methods rely on uniform smoothness and the \L{}ojasiewicz inequality. Motivated by properties of objective functions that arise in machine learning, we propose a non-uniform refinement of these notions, leading to \emph{Non-uniform Smoothness} (NS) and \emph{Non-uniform \L{}ojasiewicz inequality} (N\L{}). The new definitions inspire new geometry-aware first-order methods that are able to converge to global optimality faster than the classical $\Omega(1/t^2)$ lower bounds. To illustrate the power of these geometry-aware methods and their corresponding non-uniform analysis, we consider two important problems in machine learning: policy gradient optimization in reinforcement learning (PG), and generalized linear model training in supervised learning (GLM). For PG, we find that normalizing the gradient ascent method can accelerate convergence to $O(e^{-t})$ while incurring less overhead than existing algorithms. For GLM, we show that geometry-aware normalized gradient descent can also achieve a linear convergence rate, which significantly improves the best known results. We additionally show that the proposed geometry-aware descent methods escape landscape plateaus faster than standard gradient descent. Experimental results are used to illustrate and complement the theoretical findings.
    Combining Machine Learning and Agent-Based Modeling to Study Biomedical Systems. (arXiv:2206.01092v1 [q-bio.QM])
    Agent-based modeling (ABM) is a well-established paradigm for simulating complex systems via interactions between constituent entities. Machine learning (ML) refers to approaches whereby statistical algorithms 'learn' from data on their own, without imposing a priori theories of system behavior. Biological systems -- from molecules, to cells, to entire organisms -- consist of vast numbers of entities, governed by complex webs of interactions that span many spatiotemporal scales and exhibit nonlinearity, stochasticity and intricate coupling between entities. The macroscopic properties and collective dynamics of such systems are difficult to capture via continuum modelling and mean-field formalisms. ABM takes a 'bottom-up' approach that obviates these difficulties by enabling one to easily propose and test a set of well-defined 'rules' to be applied to the individual entities (agents) in a system. Evaluating a system and propagating its state over discrete time-steps effectively simulates the system, allowing observables to be computed and system properties to be analyzed. Because the rules that govern an ABM can be difficult to abstract and formulate from experimental data, there is an opportunity to use ML to help infer optimal, system-specific ABM rules. Once such rule-sets are devised, ABM calculations can generate a wealth of data, and ML can be applied there too -- e.g., to probe statistical measures that meaningfully describe a system's stochastic properties. As an example of synergy in the other direction (from ABM to ML), ABM simulations can generate realistic datasets for training ML algorithms (e.g., for regularization, to mitigate overfitting). In these ways, one can envision various synergistic ABM$\rightleftharpoons$ML loops. This review summarizes how ABM and ML have been integrated in contexts that span spatial scales from the cellular to population-level scale epidemiology.
    Learning to Untangle Genome Assembly with Graph Convolutional Networks. (arXiv:2206.00668v1 [q-bio.GN])
    A quest to determine the complete sequence of a human DNA from telomere to telomere started three decades ago and was finally completed in 2021. This accomplishment was a result of a tremendous effort of numerous experts who engineered various tools and performed laborious manual inspection to achieve the first gapless genome sequence. However, such method can hardly be used as a general approach to assemble different genomes, especially when the assembly speed is critical given the large amount of data. In this work, we explore a different approach to the central part of the genome assembly task that consists of untangling a large assembly graph from which a genomic sequence needs to be reconstructed. Our main motivation is to reduce human-engineered heuristics and use deep learning to develop more generalizable reconstruction techniques. Precisely, we introduce a new learning framework to train a graph convolutional network to resolve assembly graphs by finding a correct path through them. The training is supervised with a dataset generated from the resolved CHM13 human sequence and tested on assembly graphs built using real human PacBio HiFi reads. Experimental results show that a model, trained on simulated graphs generated solely from a single chromosome, is able to remarkably resolve all other chromosomes. Moreover, the model outperforms hand-crafted heuristics from a state-of-the-art \textit{de novo} assembler on the same graphs. Reconstructed chromosomes with graph networks are more accurate on nucleotide level, report lower number of contigs, higher genome reconstructed fraction and NG50/NGA50 assessment metrics.
    Graph Kernels Based on Multi-scale Graph Embeddings. (arXiv:2206.00979v1 [cs.LG])
    Graph kernels are conventional methods for computing graph similarities. However, most of the R-convolution graph kernels face two challenges: 1) They cannot compare graphs at multiple different scales, and 2) they do not consider the distributions of substructures when computing the kernel matrix. These two challenges limit their performances. To mitigate the two challenges, we propose a novel graph kernel called the Multi-scale Path-pattern Graph kernel (MPG), at the heart of which is the multi-scale path-pattern node feature map. Each element of the path-pattern node feature map is the number of occurrences of a path-pattern around a node. A path-pattern is constructed by the concatenation of all the node labels in a path of a truncated BFS tree rooted at each node. Since the path-pattern node feature map can only compare graphs at local scales, we incorporate into it the multiple different scales of the graph structure, which are captured by the truncated BFS trees of different depth. We use the Wasserstein distance to compute the similarity between the multi-scale path-pattern node feature maps of two graphs, considering the distributions of substructures. We empirically validate MPG on various benchmark graph datasets and demonstrate that it achieves state-of-the-art performance.
    Introducing One Sided Margin Loss for Solving Classification Problems in Deep Networks. (arXiv:2206.01002v1 [cs.LG])
    This paper introduces a new loss function, OSM (One-Sided Margin), to solve maximum-margin classification problems effectively. Unlike the hinge loss, in OSM the margin is explicitly determined with corresponding hyperparameters and then the classification problem is solved. In experiments, we observe that using OSM loss leads to faster training speeds and better accuracies than binary and categorical cross-entropy in several commonly used deep models for classification and optical character recognition problems. OSM has consistently shown better classification accuracies over cross-entropy and hinge losses for small to large neural networks. it has also led to a more efficient training procedure. We achieved state-of-the-art accuracies for small networks on several benchmark datasets of CIFAR10(98.82\%), CIFAR100(91.56\%), Flowers(98.04\%), Stanford Cars(93.91\%) with considerable improvements over other loss functions. Moreover, the accuracies are rather better than cross-entropy and hinge loss for large networks. Therefore, we strongly believe that OSM is a powerful alternative to hinge and cross-entropy losses to train deep neural networks on classification tasks.
    Bridging the Gap: Unifying the Training and Evaluation of Neural Network Binary Classifiers. (arXiv:2009.01367v3 [cs.LG] UPDATED)
    While neural network binary classifiers are often evaluated on metrics such as Accuracy and $F_1$-Score, they are commonly trained with a cross-entropy objective. How can this training-evaluation gap be addressed? While specific techniques have been adopted to optimize certain confusion matrix based metrics, it is challenging or impossible in some cases to generalize the techniques to other metrics. Adversarial learning approaches have also been proposed to optimize networks via confusion matrix based metrics, but they tend to be much slower than common training methods. In this work, we propose a unifying approach to training neural network binary classifiers that combines a differentiable approximation of the Heaviside function with a probabilistic view of the typical confusion matrix values using soft sets. Our theoretical analysis shows the benefit of using our method to optimize for a given evaluation metric, such as $F_1$-Score, with soft sets, and our extensive experiments show the effectiveness of our approach in several domains.
    Learning to Solve PDE-constrained Inverse Problems with Graph Networks. (arXiv:2206.00711v1 [cs.LG])
    Learned graph neural networks (GNNs) have recently been established as fast and accurate alternatives for principled solvers in simulating the dynamics of physical systems. In many application domains across science and engineering, however, we are not only interested in a forward simulation but also in solving inverse problems with constraints defined by a partial differential equation (PDE). Here we explore GNNs to solve such PDE-constrained inverse problems. Given a sparse set of measurements, we are interested in recovering the initial condition or parameters of the PDE. We demonstrate that GNNs combined with autodecoder-style priors are well-suited for these tasks, achieving more accurate estimates of initial conditions or physical parameters than other learned approaches when applied to the wave equation or Navier-Stokes equations. We also demonstrate computational speedups of up to 90x using GNNs compared to principled solvers. Project page: https://cyanzhao42.github.io/LearnInverseProblem
    Vygotskian Autotelic Artificial Intelligence: Language and Culture Internalization for Human-Like AI. (arXiv:2206.01134v1 [cs.AI])
    Building autonomous artificial agents able to grow open-ended repertoires of skills is one of the fundamental goals of AI. To that end, a promising developmental approach recommends the design of intrinsically motivated agents that learn new skills by generating and pursuing their own goals - autotelic agents. However, existing algorithms still show serious limitations in terms of goal diversity, exploration, generalization or skill composition. This perspective calls for the immersion of autotelic agents into rich socio-cultural worlds. We focus on language especially, and how its structure and content may support the development of new cognitive functions in artificial agents, just like it does in humans. Indeed, most of our skills could not be learned in isolation. Formal education teaches us to reason systematically, books teach us history, and YouTube might teach us how to cook. Crucially, our values, traditions, norms and most of our goals are cultural in essence. This knowledge, and some argue, some of our cognitive functions such as abstraction, compositional imagination or relational thinking, are formed through linguistic and cultural interactions. Inspired by the work of Vygotsky, we suggest the design of Vygotskian autotelic agents able to interact with others and, more importantly, able to internalize these interactions to transform them into cognitive tools supporting the development of new cognitive functions. This perspective paper proposes a new AI paradigm in the quest for artificial lifelong skill discovery. It justifies the approach by uncovering examples of new artificial cognitive functions emerging from interactions between language and embodiment in recent works at the intersection of deep reinforcement learning and natural language processing. Looking forward, it highlights future opportunities and challenges for Vygotskian Autotelic AI research.
    From Cities to Series: Complex Networks and Deep Learning for Improved Spatial and Temporal Analytics*. (arXiv:2206.01176v1 [cs.LG])
    Graphs have often been used to answer questions about the interaction between real-world entities by taking advantage of their capacity to represent complex topologies. Complex networks are known to be graphs that capture such non-trivial topologies; they are able to represent human phenomena such as epidemic processes, the dynamics of populations, and the urbanization of cities. The investigation of complex networks has been extrapolated to many fields of science, with particular emphasis on computing techniques, including artificial intelligence. In such a case, the analysis of the interaction between entities of interest is transposed to the internal learning of algorithms, a paradigm whose investigation is able to expand the state of the art in Computer Science. By exploring this paradigm, this thesis puts together complex networks and machine learning techniques to improve the understanding of the human phenomena observed in pandemics, pendular migration, and street networks. Accordingly, we contribute with: (i) a new neural network architecture capable of modeling dynamic processes observed in spatial and temporal data with applications in epidemics propagation, weather forecasting, and patient monitoring in intensive care units; (ii) a machine-learning methodology for analyzing and predicting links in the scope of human mobility between all the cities of Brazil; and, (iii) techniques for identifying inconsistencies in the urban planning of cities while tracking the most influential vertices, with applications over Brazilian and worldwide cities. We obtained results sustained by sound evidence of advances to the state of the art in artificial intelligence, rigorous formalisms, and ample experimentation. Our findings rely upon real-world applications in a range of domains, demonstrating the applicability of our methodologies.
    Finite-Time Analysis of Entropy-Regularized Neural Natural Actor-Critic Algorithm. (arXiv:2206.00833v1 [cs.LG])
    Natural actor-critic (NAC) and its variants, equipped with the representation power of neural networks, have demonstrated impressive empirical success in solving Markov decision problems with large state spaces. In this paper, we present a finite-time analysis of NAC with neural network approximation, and identify the roles of neural networks, regularization and optimization techniques (e.g., gradient clipping and averaging) to achieve provably good performance in terms of sample complexity, iteration complexity and overparametrization bounds for the actor and the critic. In particular, we prove that (i) entropy regularization and averaging ensure stability by providing sufficient exploration to avoid near-deterministic and strictly suboptimal policies and (ii) regularization leads to sharp sample complexity and network width bounds in the regularized MDPs, yielding a favorable bias-variance tradeoff in policy optimization. In the process, we identify the importance of uniform approximation power of the actor neural network to achieve global optimality in policy optimization due to distributional shift.
    Primal-dual extrapolation methods for monotone inclusions under local Lipschitz continuity with applications to variational inequality, conic constrained saddle point, and convex conic optimization problems. (arXiv:2206.00973v1 [math.OC])
    In this paper we consider a class of structured monotone inclusion (MI) problems that consist of finding a zero in the sum of two monotone operators, in which one is maximal monotone while another is locally Lipschitz continuous. In particular, we first propose a primal-dual extrapolation (PDE) method for solving a structured strongly MI problem by modifying the classical forward-backward splitting method by using a point and operator extrapolation technique, in which the parameters are adaptively updated by a backtracking line search scheme. The proposed PDE method is almost parameter-free, equipped with a verifiable termination criterion, and enjoys an operation complexity of ${\cal O}(\log \epsilon^{-1})$, measured by the amount of fundamental operations consisting only of evaluations of one operator and resolvent of another operator, for finding an $\epsilon$-residual solution of the structured strongly MI problem. We then propose another PDE method for solving a structured non-strongly MI problem by applying the above PDE method to approximately solve a sequence of structured strongly MI problems. The resulting PDE method is parameter-free, equipped with a verifiable termination criterion, and enjoys an operation complexity of ${\cal O}(\epsilon^{-1}\log \epsilon^{-1})$ for finding an $\epsilon$-residual solution of the structured non-strongly MI problem. As a consequence, we apply the latter PDE method to convex conic optimization, conic constrained saddle point, and variational inequality problems, and obtain complexity results for finding an $\epsilon$-KKT or $\epsilon$-residual solution of them under local Lipschitz continuity. To the best of our knowledge, no prior studies were conducted to investigate methods with complexity guarantees for solving the aforementioned problems under local Lipschitz continuity. All the complexity results obtained in this paper are entirely new.
    Mask-Guided Divergence Loss Improves the Generalization and Robustness of Deep Neural Network. (arXiv:2206.00913v1 [cs.LG])
    Deep neural network (DNN) with dropout can be regarded as an ensemble model consisting of lots of sub-DNNs (i.e., an ensemble sub-DNN where the sub-DNN is the remaining part of the DNN after dropout), and through increasing the diversity of the ensemble sub-DNN, the generalization and robustness of the DNN can be effectively improved. In this paper, a mask-guided divergence loss function (MDL), which consists of a cross-entropy loss term and an orthogonal term, is proposed to increase the diversity of the ensemble sub-DNN by the added orthogonal term. Particularly, the mask technique is introduced to assist in generating the orthogonal term for avoiding overfitting of the diversity learning. The theoretical analysis and extensive experiments on 4 datasets (i.e., MNIST, FashionMNIST, CIFAR10, and CIFAR100) manifest that MDL can improve the generalization and robustness of standard training and adversarial training. For CIFAR10 and CIFAR100, in standard training, the maximum improvement of accuracy is $1.38\%$ on natural data, $30.97\%$ on FGSM (i.e., Fast Gradient Sign Method) attack, $38.18\%$ on PGD (i.e., Projected Gradient Descent) attack. While in adversarial training, the maximum improvement is $1.68\%$ on natural data, $4.03\%$ on FGSM attack and $2.65\%$ on PGD attack.
    Boosting Independent Component Analysis. (arXiv:2112.06920v3 [stat.ML] UPDATED)
    Independent component analysis is intended to recover the mutually independent components from their linear mixtures. This technique has been widely used in many fields, such as data analysis, signal processing, and machine learning. To alleviate the dependency on prior knowledge concerning unknown sources, many nonparametric methods have been proposed. In this paper, we present a novel boosting-based algorithm for independent component analysis. Our algorithm consists of maximizing likelihood estimation via boosting and seeking unmixing matrix by the fixed-point method. A variety of experiments validate its performance compared with many of the presently known algorithms.
    Fast Benchmarking of Accuracy vs. Training Time with Cyclic Learning Rates. (arXiv:2206.00832v1 [cs.LG])
    Benchmarking the tradeoff between neural network accuracy and training time is computationally expensive. Here we show how a multiplicative cyclic learning rate schedule can be used to construct a tradeoff curve in a single training run. We generate cyclic tradeoff curves for combinations of training methods such as Blurpool, Channels Last, Label Smoothing and MixUp, and highlight how these cyclic tradeoff curves can be used to evaluate the effects of algorithmic choices on network training efficiency.
    Dynamic Structure Estimation from Bandit Feedback. (arXiv:2206.00861v1 [cs.DM])
    This work present novel method for structure estimation of an underlying dynamical system. We tackle problems of estimating dynamic structure from bandit feedback contaminated by sub-Gaussian noise. In particular, we focus on periodically behaved discrete dynamical system in the Euclidean space, and carefully identify certain obtainable subset of full information of the periodic structure. We then derive a sample complexity bound for periodic structure estimation. Technically, asymptotic results for exponential sums are adopted to effectively average out the noise effects while preventing the information to be estimated from vanishing. For linear systems, the use of the Weyl sum further allows us to extract eigenstructures. Our theoretical claims are experimentally validated on simulations of toy examples, including Cellular Automata.
    Sampling Trade-Offs in Duty-Cycled Systems for Air Quality Low-Cost Sensors. (arXiv:2112.09072v2 [eess.SP] UPDATED)
    The use of low-cost sensors in conjunction with high-precision instrumentation for air pollution monitoring has shown promising results in recent years. One of the main challenges for these sensors has been the quality of their data, which is why the main efforts have focused on calibrating the sensors using machine learning techniques to improve the data quality. However, there is one aspect that has been overlooked, that is, these sensors are mounted on nodes that may have energy consumption restrictions if they are battery-powered. In this paper, we show the usual sensor data gathering process and we study the existing trade-offs between the sampling of such sensors, the quality of the sensor calibration, and the power consumption involved. To this end, we conduct experiments on prototype nodes measuring tropospheric ozone, nitrogen dioxide, and nitrogen monoxide at high frequency. The results show that the sensor sampling strategy directly affects the quality of the air pollution estimation and that each type of sensor may require different sampling strategies. In addition, duty cycles of 0.1 can be achieved when the sensors have response times in the order of two minutes, and duty cycles between 0.01 and 0.02 can be achieved when the sensor response times are negligible, calibrating with hourly reference values and maintaining a quality of calibrated data similar to when the node is connected to an uninterruptible power supply.
    The effective noise of Stochastic Gradient Descent. (arXiv:2112.10852v3 [cond-mat.dis-nn] UPDATED)
    Stochastic Gradient Descent (SGD) is the workhorse algorithm of deep learning technology. At each step of the training phase, a mini batch of samples is drawn from the training dataset and the weights of the neural network are adjusted according to the performance on this specific subset of examples. The mini-batch sampling procedure introduces a stochastic dynamics to the gradient descent, with a non-trivial state-dependent noise. We characterize the stochasticity of SGD and a recently-introduced variant, \emph{persistent} SGD, in a prototypical neural network model. In the under-parametrized regime, where the final training error is positive, the SGD dynamics reaches a stationary state and we define an effective temperature from the fluctuation-dissipation theorem, computed from dynamical mean-field theory. We use the effective temperature to quantify the magnitude of the SGD noise as a function of the problem parameters. In the over-parametrized regime, where the training error vanishes, we measure the noise magnitude of SGD by computing the average distance between two replicas of the system with the same initialization and two different realizations of SGD noise. We find that the two noise measures behave similarly as a function of the problem parameters. Moreover, we observe that noisier algorithms lead to wider decision boundaries of the corresponding constraint satisfaction problem.
    Assessing the trade-off between prediction accuracy and interpretability for topic modeling on energetic materials corpora. (arXiv:2206.00773v1 [cs.CL])
    As the amount and variety of energetics research increases, machine aware topic identification is necessary to streamline future research pipelines. The makeup of an automatic topic identification process consists of creating document representations and performing classification. However, the implementation of these processes on energetics research imposes new challenges. Energetics datasets contain many scientific terms that are necessary to understand the context of a document but may require more complex document representations. Secondly, the predictions from classification must be understandable and trusted by the chemists within the pipeline. In this work, we study the trade-off between prediction accuracy and interpretability by implementing three document embedding methods that vary in computational complexity. With our accuracy results, we also introduce local interpretability model-agnostic explanations (LIME) of each prediction to provide a localized understanding of each prediction and to validate classifier decisions with our team of energetics experts. This study was carried out on a novel labeled energetics dataset created and validated by our team of energetics experts.
    Hard Negative Sampling Strategies for Contrastive Representation Learning. (arXiv:2206.01197v1 [cs.LG])
    One of the challenges in contrastive learning is the selection of appropriate \textit{hard negative} examples, in the absence of label information. Random sampling or importance sampling methods based on feature similarity often lead to sub-optimal performance. In this work, we introduce UnReMix, a hard negative sampling strategy that takes into account anchor similarity, model uncertainty and representativeness. Experimental results on several benchmarks show that UnReMix improves negative sample selection, and subsequently downstream performance when compared to state-of-the-art contrastive learning methods.
    How Infinitely Wide Neural Networks Benefit from Multi-task Learning -- an Exact Macroscopic Characterization. (arXiv:2112.15577v3 [cs.LG] UPDATED)
    In practice, multi-task learning (through learning features shared among tasks) is an essential property of deep neural networks (NNs). While infinite-width limits of NNs can provide a good intuition for their generalization behavior, the well-known infinite-width limits of NNs in the literature (e.g., neural tangent kernels) assume specific settings in which wide ReLU-NNs behave like shallow Gaussian Processes with a fixed kernel. Consequently, in such settings, these NNs lose their ability to benefit from multi-task learning in the infinite-width limit. In contrast, we prove that optimizing wide ReLU neural networks with at least one hidden layer using L2-regularization on the parameters enforces multi-task learning due to representation-learning - also in the limiting regime where the network width tends to infinity. We present an exact quantitative characterization of this infinite width limit in an appropriate function space that neatly describes multi-task learning.
    Dynamic Cardiac MRI Reconstruction Using Combined Tensor Nuclear Norm and Casorati Matrix Nuclear Norm Regularizations. (arXiv:2206.00831v1 [eess.IV])
    Low-rank tensor models have been applied in accelerating dynamic magnetic resonance imaging (dMRI). Recently, a new tensor nuclear norm based on t-SVD has been proposed and applied to tensor completion. Inspired by the different properties of the tensor nuclear norm (TNN) and the Casorati matrix nuclear norm (MNN), we introduce a combined TNN and Casorati MNN regularizations framework to reconstruct dMRI, which we term as TMNN. The proposed method simultaneously exploits the spatial structure and the temporal correlation of the dynamic MR data. The optimization problem can be efficiently solved by the alternating direction method of multipliers (ADMM). In order to further improve the computational efficiency, we develop a fast algorithm under the Cartesian sampling scenario. Numerical experiments based on cardiac cine MRI and perfusion MRI data demonstrate the performance improvement over the traditional Casorati nuclear norm regularization method.
    Gradient flow dynamics of shallow ReLU networks for square loss and orthogonal inputs. (arXiv:2206.00939v1 [stat.ML])
    The training of neural networks by gradient descent methods is a cornerstone of the deep learning revolution. Yet, despite some recent progress, a complete theory explaining its success is still missing. This article presents, for orthogonal input vectors, a precise description of the gradient flow dynamics of training one-hidden layer ReLU neural networks for the mean squared error at small initialisation. In this setting, despite non-convexity, we show that the gradient flow converges to zero loss and characterise its implicit bias towards minimum variation norm. Furthermore, some interesting phenomena are highlighted: a quantitative description of the initial alignment phenomenon and a proof that the process follows a specific saddle to saddle dynamics.
    New Riemannian preconditioned algorithms for tensor completion via polyadic decomposition. (arXiv:2101.11108v2 [math.OC] UPDATED)
    We propose new Riemannian preconditioned algorithms for low-rank tensor completion via the polyadic decomposition of a tensor. These algorithms exploit a non-Euclidean metric on the product space of the factor matrices of the low-rank tensor in the polyadic decomposition form. This new metric is designed using an approximation of the diagonal blocks of the Hessian of the tensor completion cost function, thus has a preconditioning effect on these algorithms. We prove that the proposed Riemannian gradient descent algorithm globally converges to a stationary point of the tensor completion problem, with convergence rate estimates using the $\L{}$ojasiewicz property. Numerical results on synthetic and real-world data suggest that the proposed algorithms are more efficient in memory and time compared to state-of-the-art algorithms. Moreover, the proposed algorithms display a greater tolerance for overestimated rank parameters in terms of the tensor recovery performance, thus enable a flexible choice of the rank parameter.
    Nearly Optimal Best-of-Both-Worlds Algorithms for Online Learning with Feedback Graphs. (arXiv:2206.00873v1 [cs.LG])
    This study considers online learning with general directed feedback graphs. For this problem, we present best-of-both-worlds algorithms that achieve nearly tight regret bounds for adversarial environments as well as poly-logarithmic regret bounds for stochastic environments. As Alon et al. [2015] have shown, tight regret bounds depend on the structure of the feedback graph: \textit{strongly observable} graphs yield minimax regret of $\tilde{\Theta}( \alpha^{1/2} T^{1/2} )$, while \textit{weakly observable} graphs induce minimax regret of $\tilde{\Theta}( \delta^{1/3} T^{2/3} )$, where $\alpha$ and $\delta$, respectively, represent the independence number of the graph and the domination number of a certain portion of the graph. Our proposed algorithm for strongly observable graphs has a regret bound of $\tilde{O}( \alpha^{1/2} T^{1/2} ) $ for adversarial environments, as well as of $ {O} ( \frac{\alpha (\ln T)^3 }{\Delta_{\min}} ) $ for stochastic environments, where $\Delta_{\min}$ expresses the minimum suboptimality gap. This result resolves an open question raised by Erez and Koren [2021]. We also provide an algorithm for weakly observable graphs that achieves a regret bound of $\tilde{O}( \delta^{1/3}T^{2/3} )$ for adversarial environments and poly-logarithmic regret for stochastic environments. The proposed algorithms are based on the follow-the-perturbed-leader approach combined with newly designed update rules for learning rates.
    Applied Federated Learning: Architectural Design for Robust and Efficient Learning in Privacy Aware Settings. (arXiv:2206.00807v1 [cs.LG])
    The classical machine learning paradigm requires the aggregation of user data in a central location where machine learning practitioners can preprocess data, calculate features, tune models and evaluate performance. The advantage of this approach includes leveraging high performance hardware (such as GPUs) and the ability of machine learning practitioners to do in depth data analysis to improve model performance. However, these advantages may come at a cost to data privacy. User data is collected, aggregated, and stored on centralized servers for model development. Centralization of data poses risks, including a heightened risk of internal and external security incidents as well as accidental data misuse. Federated learning with differential privacy is designed to avoid the server-side centralization pitfall by bringing the ML learning step to users' devices. Learning is done in a federated manner where each mobile device runs a training loop on a local copy of a model. Updates from on-device models are sent to the server via encrypted communication and through differential privacy to improve the global model. In this paradigm, users' personal data remains on their devices. Surprisingly, model training in this manner comes at a fairly minimal degradation in model performance. However, federated learning comes with many other challenges due to its distributed nature, heterogeneous compute environments and lack of data visibility. This paper explores those challenges and outlines an architectural design solution we are exploring and testing to productionize federated learning at Meta scale.
    Robustness to Label Noise Depends on the Shape of the Noise Distribution in Feature Space. (arXiv:2206.01106v1 [cs.LG])
    Machine learning classifiers have been demonstrated, both empirically and theoretically, to be robust to label noise under certain conditions -- notably the typical assumption is that label noise is independent of the features given the class label. We provide a theoretical framework that generalizes beyond this typical assumption by modeling label noise as a distribution over feature space. We show that both the scale and the shape of the noise distribution influence the posterior likelihood; and the shape of the noise distribution has a stronger impact on classification performance if the noise is concentrated in feature space where the decision boundary can be moved. For the special case of uniform label noise (independent of features and the class label), we show that the Bayes optimal classifier for $c$ classes is robust to label noise until the ratio of noisy samples goes above $\frac{c-1}{c}$ (e.g. 90% for 10 classes), which we call the tipping point. However, for the special case of class-dependent label noise (independent of features given the class label), the tipping point can be as low as 50%. Most importantly, we show that when the noise distribution targets decision boundaries (label noise is directly dependent on feature space), classification robustness can drop off even at a small scale of noise. Even when evaluating recent label-noise mitigation methods we see reduced accuracy when label noise is dependent on features. These findings explain why machine learning often handles label noise well if the noise distribution is uniform in feature-space; yet it also points to the difficulty of overcoming label noise when it is concentrated in a region of feature space where a decision boundary can move.
    Bayesian Model Selection, the Marginal Likelihood, and Generalization. (arXiv:2202.11678v2 [cs.LG] UPDATED)
    How do we compare between hypotheses that are entirely consistent with observations? The marginal likelihood (aka Bayesian evidence), which represents the probability of generating our observations from a prior, provides a distinctive approach to this foundational question, automatically encoding Occam's razor. Although it has been observed that the marginal likelihood can overfit and is sensitive to prior assumptions, its limitations for hyperparameter learning and discrete model comparison have not been thoroughly investigated. We first revisit the appealing properties of the marginal likelihood for learning constraints and hypothesis testing. We then highlight the conceptual and practical issues in using the marginal likelihood as a proxy for generalization. Namely, we show how marginal likelihood can be negatively correlated with generalization, with implications for neural architecture search, and can lead to both underfitting and overfitting in hyperparameter learning. We provide a partial remedy through a conditional marginal likelihood, which we show is more aligned with generalization, and practically valuable for large-scale hyperparameter learning, such as in deep kernel learning.
    Analyzing Lottery Ticket Hypothesis from PAC-Bayesian Theory Perspective. (arXiv:2205.07320v2 [cs.LG] UPDATED)
    The lottery ticket hypothesis (LTH) has attracted attention because it can explain why over-parameterized models often show high generalization ability. It is known that when we use iterative magnitude pruning (IMP), which is an algorithm to find sparse networks with high generalization ability that can be trained from the initial weights independently, called winning tickets, the initial large learning rate does not work well in deep neural networks such as ResNet. However, since the initial large learning rate generally helps the optimizer to converge to flatter minima, we hypothesize that the winning tickets have relatively sharp minima, which is considered a disadvantage in terms of generalization ability. In this paper, we confirm this hypothesis and show that the PAC-Bayesian theory can provide an explicit understanding of the relationship between LTH and generalization behavior. On the basis of our experimental findings that flatness is useful for improving accuracy and robustness to label noise and that the distance from the initial weights is deeply involved in winning tickets, we offer the PAC-Bayes bound using a spike-and-slab distribution to analyze winning tickets. Finally, we revisit existing algorithms for finding winning tickets from a PAC-Bayesian perspective and provide new insights into these methods.
    Robust Longitudinal Control for Vehicular Autonomous Platoons Using Deep Reinforcement Learning. (arXiv:2206.01175v1 [eess.SY])
    In the last few years, researchers have applied machine learning strategies in the context of vehicular platoons to increase the safety and efficiency of cooperative transportation. Reinforcement Learning methods have been employed in the longitudinal spacing control of Cooperative Adaptive Cruise Control systems, but to date, none of those studies have addressed problems of disturbance rejection in such scenarios. Characteristics such as uncertain parameters in the model and external interferences may prevent agents from reaching null-spacing errors when traveling at cruising speed. On the other hand, complex communication topologies lead to specific training processes that can not be generalized to other contexts, demanding re-training every time the configuration changes. Therefore, in this paper, we propose an approach to generalize the training process of a vehicular platoon, such that the acceleration command of each agent becomes independent of the network topology. Also, we have modeled the acceleration input as a term with integral action, such that the Convolutional Neural Network is capable of learning corrective actions when the states are disturbed by unknown effects. We illustrate the effectiveness of our proposal with experiments using different network topologies, uncertain parameters, and external forces. Comparative analyses, in terms of the steady-state error and overshoot response, were conducted against the state-of-the-art literature. The findings offer new insights concerning generalization and robustness of using Reinforcement Learning in the control of autonomous platoons.
    Policy Gradient Algorithms with Monte-Carlo Tree Search for Non-Markov Decision Processes. (arXiv:2206.01011v1 [cs.LG])
    Policy gradient (PG) is a reinforcement learning (RL) approach that optimizes a parameterized policy model for an expected return using gradient ascent. Given a well-parameterized policy model, such as a neural network model, with appropriate initial parameters, the PG algorithms work well even when environment does not have the Markov property. Otherwise, they can be trapped on a plateau or suffer from peakiness effects. As another successful RL approach, algorithms based on Monte-Carlo Tree Search (MCTS), which include AlphaZero, have obtained groundbreaking results especially on the board game playing domain. They are also suitable to be applied to non-Markov decision processes. However, since the standard MCTS does not have the ability to learn state representation, the size of the tree-search space can be too large to search. In this work, we examine a mixture policy of PG and MCTS to complement each other's difficulties and take advantage of them. We derive conditions for asymptotic convergence with results of a two-timescale stochastic approximation and propose an algorithm that satisfies these conditions. The effectivity of the proposed methods is verified through numerical experiments on non-Markov decision processes.
    Deep Learning on Implicit Neural Datasets. (arXiv:2206.01178v1 [cs.LG])
    Implicit neural representations (INRs) have become fast, lightweight tools for storing continuous data, but to date there is no general method for learning directly with INRs as a data representation. We introduce a principled deep learning framework for learning and inference directly with INRs of any type without reverting to grid-based features or operations. Our INR-Nets evaluate INRs on a low discrepancy sequence, enabling quasi-Monte Carlo (QMC) integration throughout the network. We prove INR-Nets are universal approximators on a large class of maps between $L^2$ functions. Additionally, INR-Nets have convergent gradients under the empirical measure, enabling backpropagation. We design INR-Nets as a continuous generalization of discrete networks, enabling them to be initialized with pre-trained models. We demonstrate learning of INR-Nets on classification (INR$\to$label) and segmentation (INR$\to$INR) tasks.
    Phase diagram of Stochastic Gradient Descent in high-dimensional two-layer neural networks. (arXiv:2202.00293v2 [stat.ML] UPDATED)
    Despite the non-convex optimization landscape, over-parametrized shallow networks are able to achieve global convergence under gradient descent. The picture can be radically different for narrow networks, which tend to get stuck in badly-generalizing local minima. Here we investigate the cross-over between these two regimes in the high-dimensional setting, and in particular investigate the connection between the so-called mean-field/hydrodynamic regime and the seminal approach of Saad & Solla. Focusing on the case of Gaussian data, we study the interplay between the learning rate, the time scale, and the number of hidden units in the high-dimensional dynamics of stochastic gradient descent (SGD). Our work builds on a deterministic description of SGD in high-dimensions from statistical physics, which we extend and for which we provide rigorous convergence rates.
    Rare Gems: Finding Lottery Tickets at Initialization. (arXiv:2202.12002v2 [cs.LG] UPDATED)
    Large neural networks can be pruned to a small fraction of their original size, with little loss in accuracy, by following a time-consuming "train, prune, re-train" approach. Frankle & Carbin conjecture that we can avoid this by training "lottery tickets", i.e., special sparse subnetworks found at initialization, that can be trained to high accuracy. However, a subsequent line of work by Frankle et al. and Su et al. presents concrete evidence that current algorithms for finding trainable networks at initialization, fail simple baseline comparisons, e.g., against training random sparse subnetworks. Finding lottery tickets that train to better accuracy compared to simple baselines remains an open problem. In this work, we resolve this open problem by proposing Gem-Miner which finds lottery tickets at initialization that beat current baselines. Gem-Miner finds lottery tickets trainable to accuracy competitive or better than Iterative Magnitude Pruning (IMP), and does so up to $19\times$ faster.
    Merlin-Arthur Classifiers: Formal Interpretability with Interactive Black Boxes. (arXiv:2206.00759v1 [cs.LG])
    We present a new theoretical framework for making black box classifiers such as Neural Networks interpretable, basing our work on clear assumptions and guarantees. In our setting, which is inspired by the Merlin-Arthur protocol from Interactive Proof Systems, two functions cooperate to achieve a classification together: the \emph{prover} selects a small set of features as a certificate and presents it to the \emph{classifier}. Including a second, adversarial prover allows us to connect a game-theoretic equilibrium to information-theoretic guarantees on the exchanged features. We define notions of completeness and soundness that enable us to lower bound the mutual information between features and class. To demonstrate good agreement between theory and practice, we support our framework by providing numerical experiments for Neural Network classifiers, explicitly calculating the mutual information of features with respect to the class.
    On Reinforcement Learning and Distribution Matching for Fine-Tuning Language Models with no Catastrophic Forgetting. (arXiv:2206.00761v1 [cs.LG])
    The availability of large pre-trained models is changing the landscape of Machine Learning research and practice, moving from a training-from-scratch to a fine-tuning paradigm. While in some applications the goal is to "nudge" the pre-trained distribution towards preferred outputs, in others it is to steer it towards a different distribution over the sample space. Two main paradigms have emerged to tackle this challenge: Reward Maximization (RM) and, more recently, Distribution Matching (DM). RM applies standard Reinforcement Learning (RL) techniques, such as Policy Gradients, to gradually increase the reward signal. DM prescribes to first make explicit the target distribution that the model is fine-tuned to approximate. Here we explore the theoretical connections between the two paradigms, and show that methods such as KL-control developed for RM can also be construed as belonging to DM. We further observe that while DM differs from RM, it can suffer from similar training difficulties, such as high gradient variance. We leverage connections between the two paradigms to import the concept of baseline into DM methods. We empirically validate the benefits of adding a baseline on an array of controllable language generation tasks such as constraining topic, sentiment, and gender distributions in texts sampled from a language model. We observe superior performance in terms of constraint satisfaction, stability and sample efficiency.
    BayesFormer: Transformer with Uncertainty Estimation. (arXiv:2206.00826v1 [cs.CL])
    Transformer has become ubiquitous due to its dominant performance in various NLP and image processing tasks. However, it lacks understanding of how to generate mathematically grounded uncertainty estimates for transformer architectures. Models equipped with such uncertainty estimates can typically improve predictive performance, make networks robust, avoid over-fitting and used as acquisition function in active learning. In this paper, we introduce BayesFormer, a Transformer model with dropouts designed by Bayesian theory. We proposed a new theoretical framework to extend the approximate variational inference-based dropout to Transformer-based architectures. Through extensive experiments, we validate the proposed architecture in four paradigms and show improvements across the board: language modeling and classification, long-sequence understanding, machine translation and acquisition function for active learning.
    Comparing Conventional and Deep Feature Models for Classifying Fundus Photography of Hemorrhages. (arXiv:2206.01118v1 [eess.IV])
    Diabetic retinopathy is an eye-related pathology creating abnormalities and causing visual impairment, proper treatment of which requires identifying irregularities. This research uses a hemorrhage detection method and compares classification of conventional and deep features. Especially, method identifies hemorrhage connected with blood vessels or reside at retinal border and reported challenging. Initially, adaptive brightness adjustment and contrast enhancement rectify degraded images. Prospective locations of hemorrhages are estimated by a Gaussian matched filter, entropy thresholding, and morphological operation. Hemorrhages are segmented by a novel technique based on regional variance of intensities. Features are then extracted by conventional methods and deep models for training support vector machines, and results evaluated. Evaluation metrics for each model are promising, but findings suggest that comparatively, deep models are more effective than conventional features.
    RNNs of RNNs: Recursive Construction of Stable Assemblies of Recurrent Neural Networks. (arXiv:2106.08928v4 [cs.LG] UPDATED)
    Recurrent neural networks (RNNs) are widely used throughout neuroscience as models of local neural activity. Many properties of single RNNs are well characterized theoretically, but experimental neuroscience has moved in the direction of studying multiple interacting areas, and RNN theory needs to be likewise extended. We take a constructive approach towards this problem, leveraging tools from nonlinear control theory and machine learning to characterize when combinations of stable RNNs will themselves be stable. Importantly, we derive conditions which allow for massive feedback connections between interacting RNNs. We parameterize these conditions for easy optimization using gradient-based techniques, and show that stability-constrained `network of networks' can perform well on challenging sequential-processing benchmark tasks. Altogether, our results provide a principled approach towards understanding distributed, modular function in the brain.
    Nest Your Adaptive Algorithm for Parameter-Agnostic Nonconvex Minimax Optimization. (arXiv:2206.00743v1 [math.OC])
    Adaptive algorithms like AdaGrad and AMSGrad are successful in nonconvex optimization owing to their parameter-agnostic ability -- requiring no a priori knowledge about problem-specific parameters nor tuning of learning rates. However, when it comes to nonconvex minimax optimization, direct extensions of such adaptive optimizers without proper time-scale separation may fail to work in practice. We provide such an example proving that the simple combination of Gradient Descent Ascent (GDA) with adaptive stepsizes can diverge if the primal-dual stepsize ratio is not carefully chosen; hence, a fortiori, such adaptive extensions are not parameter-agnostic. To address the issue, we formally introduce a Nested Adaptive framework, NeAda for short, that carries an inner loop for adaptively maximizing the dual variable with controllable stopping criteria and an outer loop for adaptively minimizing the primal variable. Such mechanism can be equipped with off-the-shelf adaptive optimizers and automatically balance the progress in the primal and dual variables. Theoretically, for nonconvex-strongly-concave minimax problems, we show that NeAda can achieve the near-optimal $\tilde{O}(\epsilon^{-2})$ and $\tilde{O}(\epsilon^{-4})$ gradient complexities respectively in the deterministic and stochastic settings, without prior information on the problem's smoothness and strong concavity parameters. To the best of our knowledge, this is the first algorithm that simultaneously achieves near-optimal convergence rates and parameter-agnostic adaptation in the nonconvex minimax setting. Numerically, we further illustrate the robustness of the NeAda family with experiments on simple test functions and a real-world application.
    Finding the Right Recipe for Low Resource Domain Adaptation in Neural Machine Translation. (arXiv:2206.01137v1 [cs.CL])
    General translation models often still struggle to generate accurate translations in specialized domains. To guide machine translation practitioners and characterize the effectiveness of domain adaptation methods under different data availability scenarios, we conduct an in-depth empirical exploration of monolingual and parallel data approaches to domain adaptation of pre-trained, third-party, NMT models in settings where architecture change is impractical. We compare data centric adaptation methods in isolation and combination. We study method effectiveness in very low resource (8k parallel examples) and moderately low resource (46k parallel examples) conditions and propose an ensemble approach to alleviate reductions in original domain translation quality. Our work includes three domains: consumer electronic, clinical, and biomedical and spans four language pairs - Zh-En, Ja-En, Es-En, and Ru-En. We also make concrete recommendations for achieving high in-domain performance and release our consumer electronic and medical domain datasets for all languages and make our code publicly available.
    Compositional Coding Capsule Network with K-Means Routing for Text Classification. (arXiv:1810.09177v5 [cs.LG] UPDATED)
    Text classification is a challenging problem which aims to identify the category of texts. In the process of training, word embeddings occupy a large part of parameters. Under the limitation of limited computing resources, it indirectly limits the ability of subsequent network designs. In order to reduce the number of parameters, the compositional coding mechanism has been proposed recently. Based on this, this paper further explores compositional coding and proposes a compositional weighted coding method. And we apply capsule network to model the relationship between word embeddings, a new routing algorithm, which is based on k-means clustering theory, is proposed to fully mine the relationship between word embeddings. Combined with our compositional weighted coding method and the routing algorithm, we design a neural network for text classification. Experiments conducted on eight challenging text classification datasets show that the proposed method achieves competitive accuracy compared to the state-of-the-art approach with significantly fewer parameters.
    On the Difficulty of Defending Self-Supervised Learning against Model Extraction. (arXiv:2205.07890v2 [cs.LG] UPDATED)
    Self-Supervised Learning (SSL) is an increasingly popular ML paradigm that trains models to transform complex inputs into representations without relying on explicit labels. These representations encode similarity structures that enable efficient learning of multiple downstream tasks. Recently, ML-as-a-Service providers have commenced offering trained SSL models over inference APIs, which transform user inputs into useful representations for a fee. However, the high cost involved to train these models and their exposure over APIs both make black-box extraction a realistic security threat. We thus explore model stealing attacks against SSL. Unlike traditional model extraction on classifiers that output labels, the victim models here output representations; these representations are of significantly higher dimensionality compared to the low-dimensional prediction scores output by classifiers. We construct several novel attacks and find that approaches that train directly on a victim's stolen representations are query efficient and enable high accuracy for downstream models. We then show that existing defenses against model extraction are inadequate and not easily retrofitted to the specificities of SSL.
    Numeric Lyndon-based feature embedding of sequencing reads for machine learning approaches. (arXiv:2202.13884v2 [q-bio.GN] UPDATED)
    Feature embedding methods have been proposed in literature to represent sequences as numeric vectors to be used in some bioinformatics investigations, such as family classification and protein structure prediction. Recent theoretical results showed that the well-known Lyndon factorization preserves common factors in overlapping strings. Surprisingly, the fingerprint of a sequencing read, which is the sequence of lengths of consecutive factors in variants of the Lyndon factorization of the read, is effective in preserving sequence similarities, suggesting it as basis for the definition of novels representations of sequencing reads. We propose a novel feature embedding method for Next-Generation Sequencing (NGS) data using the notion of fingerprint. We provide a theoretical and experimental framework to estimate the behaviour of fingerprints and of the $k$-mers extracted from it, called $k$-fingers, as possible feature embeddings for sequencing reads. As a case study to assess the effectiveness of such embeddings, we use fingerprints to represent RNA-Seq reads and to assign them to the most likely gene from which they were originated as fragments of transcripts of the gene. We provide an implementation of the proposed method in the tool lyn2vec, which produces Lyndon-based feature embeddings of sequencing reads.
    Uncalibrated Models Can Improve Human-AI Collaboration. (arXiv:2202.05983v2 [cs.AI] UPDATED)
    In many practical applications of AI, an AI model is used as a decision aid for human users. The AI provides advice that a human (sometimes) incorporates into their decision-making process. The AI advice is often presented with some measure of "confidence" that the human can use to calibrate how much they depend on or trust the advice. In this paper, we demonstrate that human-AI performance can be improved by calibrating this confidence to the humans using the advice. In practice, this means presenting calibrated AI models as more or less confident than they actually are. We show empirically that this can improve human-AI performance (measured as the accuracy and confidence of the human's final prediction after seeing the AI advice). We first train a model to predict human incorporation of AI advice using data from thousands of human interactions. This enables us to explicitly estimate how to transform the AI's prediction confidence, making the AI uncalibrated, in order to improve the final human prediction. We empirically validate our results across four different tasks--dealing with images, text and tabular data--involving hundreds of human participants. We further support our findings with simulation analysis. Our findings suggest the importance of and a framework for jointly optimizing the human-AI system in contrast to the standard paradigm of optimizing the AI model alone.
    Sparse Mixed Linear Regression with Guarantees: Taming an Intractable Problem with Invex Relaxation. (arXiv:2206.01167v1 [cs.LG])
    In this paper, we study the problem of sparse mixed linear regression on an unlabeled dataset that is generated from linear measurements from two different regression parameter vectors. Since the data is unlabeled, our task is not only to figure out a good approximation of the regression parameter vectors but also to label the dataset correctly. In its original form, this problem is NP-hard. The most popular algorithms to solve this problem (such as Expectation-Maximization) have a tendency to stuck at local minima. We provide a novel invex relaxation for this intractable problem which leads to a solution with provable theoretical guarantees. This relaxation enables exact recovery of data labels. Furthermore, we recover a close approximation of the regression parameter vectors which match the true parameter vectors in support and sign. Our formulation uses a carefully constructed primal dual witnesses framework for the invex problem. Furthermore, we show that the sample complexity of our method is only logarithmic in terms of the dimension of the regression parameter vectors.
    Composition of Relational Features with an Application to Explaining Black-Box Predictors. (arXiv:2206.00738v1 [cs.LG])
    Relational machine learning programs like those developed in Inductive Logic Programming (ILP) offer several advantages: (1) The ability to model complex relationships amongst data instances; (2) The use of domain-specific relations during model construction; and (3) The models constructed are human-readable, which is often one step closer to being human-understandable. However, these ILP-like methods have not been able to capitalise fully on the rapid hardware, software and algorithmic developments fuelling current developments in deep neural networks. In this paper, we treat relational features as functions and use the notion of generalised composition of functions to derive complex functions from simpler ones. We formulate the notion of a set of $\text{M}$-simple features in a mode language $\text{M}$ and identify two composition operators ($\rho_1$ and $\rho_2$) from which all possible complex features can be derived. We use these results to implement a form of "explainable neural network" called Compositional Relational Machines, or CRMs, which are labelled directed-acyclic graphs. The vertex-label for any vertex $j$ in the CRM contains a feature-function $f_j$ and a continuous activation function $g_j$. If $j$ is a "non-input" vertex, then $f_j$ is the composition of features associated with vertices in the direct predecessors of $j$. Our focus is on CRMs in which input vertices (those without any direct predecessors) all have $\text{M}$-simple features in their vertex-labels. We provide a randomised procedure for constructing and learning such CRMs. Using a notion of explanations based on the compositional structure of features in a CRM, we provide empirical evidence on synthetic data of the ability to identify appropriate explanations; and demonstrate the use of CRMs as 'explanation machines' for black-box models that do not provide explanations for their predictions.
    Combining machine learning with physics: A framework for tracking and sorting multiple dark solitons. (arXiv:2111.04881v2 [cond-mat.quant-gas] UPDATED)
    In ultracold-atom experiments, data often comes in the form of images which suffer information loss inherent in the techniques used to prepare and measure the system. This is particularly problematic when the processes of interest are complicated, such as interactions among excitations in Bose-Einstein condensates (BECs). In this paper, we describe a framework combining machine learning (ML) models with physics-based traditional analyses to identify and track multiple solitonic excitations in images of BECs. We use an ML-based object detector to locate the solitonic excitations and develop a physics-informed classifier to sort solitonic excitations into physically motivated subcategories. Lastly, we introduce a quality metric quantifying the likelihood that a specific feature is a longitudinal soliton. Our trained implementation of this framework, SolDet, is publicly available as an open-source python package. SolDet is broadly applicable to feature identification in cold-atom images when trained on a suitable user-provided dataset.
    Dynamic Privacy Budget Allocation Improves Data Efficiency of Differentially Private Gradient Descent. (arXiv:2101.07413v2 [cs.LG] UPDATED)
    Protecting privacy in learning while maintaining the model performance has become increasingly critical in many applications that involve sensitive data. A popular private learning framework is differentially private learning composed of many privatized gradient iterations by noising and clipping. Under the privacy constraint, it has been shown that the dynamic policies could improve the final iterate loss, namely the quality of published models. In this talk, we will introduce these dynamic techniques for learning rate, batch size, noise magnitude and gradient clipping. Also, we discuss how the dynamic policy could change the convergence bounds which further provides insight of the impact of dynamic methods.
    SanitAIs: Unsupervised Data Augmentation to Sanitize Trojaned Neural Networks. (arXiv:2109.04566v3 [cs.LG] UPDATED)
    Self-supervised learning (SSL) methods have resulted in broad improvements to neural network performance by leveraging large, untapped collections of unlabeled data to learn generalized underlying structure. In this work, we harness unsupervised data augmentation (UDA), an SSL technique, to mitigate backdoor or Trojan attacks on deep neural networks. We show that UDA is more effective at removing trojans than current state-of-the-art methods for both feature space and point triggers, over a range of model architectures, trojans, and data quantities provided for trojan removal. These results demonstrate that UDA is both an effective and practical approach to mitigating the effects of backdoors on neural networks.
    On the Generalization of Neural Combinatorial Optimization Heuristics. (arXiv:2206.00787v1 [cs.LG])
    Neural Combinatorial Optimization approaches have recently leveraged the expressiveness and flexibility of deep neural networks to learn efficient heuristics for hard Combinatorial Optimization (CO) problems. However, most of the current methods lack generalization: for a given CO problem, heuristics which are trained on instances with certain characteristics underperform when tested on instances with different characteristics. While some previous works have focused on varying the training instances properties, we postulate that a one-size-fit-all model is out of reach. Instead, we formalize solving a CO problem over a given instance distribution as a separate learning task and investigate meta-learning techniques to learn a model on a variety of tasks, in order to optimize its capacity to adapt to new tasks. Through extensive experiments, on two CO problems, using both synthetic and realistic instances, we show that our proposed meta-learning approach significantly improves the generalization of two state-of-the-art models.
    DepthShrinker: A New Compression Paradigm Towards Boosting Real-Hardware Efficiency of Compact Neural Networks. (arXiv:2206.00843v1 [cs.LG])
    Efficient deep neural network (DNN) models equipped with compact operators (e.g., depthwise convolutions) have shown great potential in reducing DNNs' theoretical complexity (e.g., the total number of weights/operations) while maintaining a decent model accuracy. However, existing efficient DNNs are still limited in fulfilling their promise in boosting real-hardware efficiency, due to their commonly adopted compact operators' low hardware utilization. In this work, we open up a new compression paradigm for developing real-hardware efficient DNNs, leading to boosted hardware efficiency while maintaining model accuracy. Interestingly, we observe that while some DNN layers' activation functions help DNNs' training optimization and achievable accuracy, they can be properly removed after training without compromising the model accuracy. Inspired by this observation, we propose a framework dubbed DepthShrinker, which develops hardware-friendly compact networks via shrinking the basic building blocks of existing efficient DNNs that feature irregular computation patterns into dense ones with much improved hardware utilization and thus real-hardware efficiency. Excitingly, our DepthShrinker framework delivers hardware-friendly compact networks that outperform both state-of-the-art efficient DNNs and compression techniques, e.g., a 3.06\% higher accuracy and 1.53$\times$ throughput on Tesla V100 over SOTA channel-wise pruning method MetaPruning. Our codes are available at: https://github.com/RICE-EIC/DepthShrinker.
    Collaborative Learning of Distributions under Heterogeneity and Communication Constraints. (arXiv:2206.00707v1 [stat.ML])
    In modern machine learning, users often have to collaborate to learn distributions that generate the data. Communication can be a significant bottleneck. Prior work has studied homogeneous users -- i.e., whose data follow the same discrete distribution -- and has provided optimal communication-efficient methods. However, these methods rely heavily on homogeneity, and are less applicable in the common case when users' discrete distributions are heterogeneous. Here we consider a natural and tractable model of heterogeneity, where users' discrete distributions only vary sparsely, on a small number of entries. We propose a novel two-stage method named SHIFT: First, the users collaborate by communicating with the server to learn a central distribution; relying on methods from robust statistics. Then, the learned central distribution is fine-tuned to estimate the individual distributions of users. We show that SHIFT is minimax optimal in our model of heterogeneity and under communication constraints. Further, we provide experimental results using both synthetic data and $n$-gram frequency estimation in the text domain, which corroborate its efficiency.  ( 2 min )
    Indeterminacy in Latent Variable Models: Characterization and Strong Identifiability. (arXiv:2206.00801v1 [stat.ML])
    Most modern latent variable and probabilistic generative models, such as the variational autoencoder (VAE), have certain indeterminacies that are unresolvable even with an infinite amount of data. Recent applications of such models have indicated the need for \textit{strongly} identifiable models, in which an observation corresponds to a unique latent code. Progress has been made towards reducing model indeterminacies while maintaining flexibility, most notably by the iVAE (arXiv:1907.04809 [stat.ML]), which excludes many -- but not all -- indeterminacies. We construct a full theoretical framework for analyzing the indeterminacies of latent variable models, and characterize them precisely in terms of properties of the generator functions and the latent variable prior distributions. To illustrate, we apply the framework to better understand the structure of recent identifiability results. We then investigate how we might specify strongly identifiable latent variable models, and construct two such classes of models. One is a straightforward modification of iVAE; the other uses ideas from optimal transport and leads to novel models and connections to recent work.  ( 2 min )
    Stabilizing Q-learning with Linear Architectures for Provably Efficient Learning. (arXiv:2206.00796v1 [cs.LG])
    The $Q$-learning algorithm is a simple and widely-used stochastic approximation scheme for reinforcement learning, but the basic protocol can exhibit instability in conjunction with function approximation. Such instability can be observed even with linear function approximation. In practice, tools such as target networks and experience replay appear to be essential, but the individual contribution of each of these mechanisms is not well understood theoretically. This work proposes an exploration variant of the basic $Q$-learning protocol with linear function approximation. Our modular analysis illustrates the role played by each algorithmic tool that we adopt: a second order update rule, a set of target networks, and a mechanism akin to experience replay. Together, they enable state of the art regret bounds on linear MDPs while preserving the most prominent feature of the algorithm, namely a space complexity independent of the number of step elapsed. We show that the performance of the algorithm degrades very gracefully under a novel and more permissive notion of approximation error. The algorithm also exhibits a form of instance-dependence, in that its performance depends on the "effective" feature dimension.  ( 2 min )
    Bayesian Inference of Stochastic Dynamical Networks. (arXiv:2206.00858v1 [stat.ML])
    Network inference has been extensively studied in several fields, such as systems biology and social sciences. Learning network topology and internal dynamics is essential to understand mechanisms of complex systems. In particular, sparse topologies and stable dynamics are fundamental features of many real-world continuous-time networks. Given that usually only a partial set of nodes are able to observe, in this paper, we consider linear continuous-time systems to depict networks since they can model unmeasured nodes via transfer functions. Additionally, measurements tend to be noisy and with low and varying sampling frequencies. For this reason, we consider continuous-time models (CT) since discrete-time approximations often require fine-grained measurements and uniform sampling steps. The developed method applies dynamical structure functions (DSFs) derived from linear stochastic differential equations (SDEs) to describe networks of measured nodes. Further, a numerical sampling method, preconditioned Crank-Nicolson (pCN), is used to refine coarse-grained trajectories to improve inference accuracy. The simulation conducted on random and ring networks, and a synthetic biological network illustrate that our method achieves state-of-the-art performance compared with group sparse Bayesian learning (GSBL), BINGO, kernel-based methods, dynGENIE3, GENIE3 and ARNI. In particular, these are challenging networks, suggesting that the developed method can be applied under a wide range of contexts.  ( 2 min )
    Coordinated Double Machine Learning. (arXiv:2206.00885v1 [stat.ML])
    Double machine learning is a statistical method for leveraging complex black-box models to construct approximately unbiased treatment effect estimates given observational data with high-dimensional covariates, under the assumption of a partially linear model. The idea is to first fit on a subset of the samples two non-linear predictive models, one for the continuous outcome of interest and one for the observed treatment, and then to estimate a linear coefficient for the treatment using the remaining samples through a simple orthogonalized regression. While this methodology is flexible and can accommodate arbitrary predictive models, typically trained independently of one another, this paper argues that a carefully coordinated learning algorithm for deep neural networks may reduce the estimation bias. The improved empirical performance of the proposed method is demonstrated through numerical experiments on both simulated and real data.  ( 2 min )
    (Machine) Learning What Policies Value. (arXiv:2206.00727v1 [econ.GN])
    When a policy prioritizes one person over another, is it because they benefit more, or because they are preferred? This paper develops a method to uncover the values consistent with observed allocation decisions. We use machine learning methods to estimate how much each individual benefits from an intervention, and then reconcile its allocation with (i) the welfare weights assigned to different people; (ii) heterogeneous treatment effects of the intervention; and (iii) weights on different outcomes. We demonstrate this approach by analyzing Mexico's PROGRESA anti-poverty program. The analysis reveals that while the program prioritized certain subgroups -- such as indigenous households -- the fact that those groups benefited more implies that they were in fact assigned a lower welfare weight. The PROGRESA case illustrates how the method makes it possible to audit existing policies, and to design future policies that better align with values.  ( 2 min )
    Federated Learning in Non-IID Settings Aided by Differentially Private Synthetic Data. (arXiv:2206.00686v1 [cs.LG])
    Federated learning (FL) is a privacy-promoting framework that enables potentially large number of clients to collaboratively train machine learning models. In a FL system, a server coordinates the collaboration by collecting and aggregating clients' model updates while the clients' data remains local and private. A major challenge in federated learning arises when the local data is heterogeneous -- the setting in which performance of the learned global model may deteriorate significantly compared to the scenario where the data is identically distributed across the clients. In this paper we propose FedDPMS (Federated Differentially Private Means Sharing), an FL algorithm in which clients deploy variational auto-encoders to augment local datasets with data synthesized using differentially private means of latent data representations communicated by a trusted server. Such augmentation ameliorates effects of data heterogeneity across the clients without compromising privacy. Our experiments on deep image classification tasks demonstrate that FedDPMS outperforms competing state-of-the-art FL methods specifically designed for heterogeneous data settings.  ( 2 min )
    How Biased is Your Feature?: Computing Fairness Influence Functions with Global Sensitivity Analysis. (arXiv:2206.00667v1 [cs.LG])
    Fairness in machine learning has attained significant focus due to the widespread application of machine learning in high-stake decision-making tasks. Unless regulated with a fairness objective, machine learning classifiers might demonstrate unfairness/bias towards certain demographic populations in the data. Thus, the quantification and mitigation of the bias induced by classifiers have become a central concern. In this paper, we aim to quantify the influence of different features on the bias of a classifier. To this end, we propose a framework of Fairness Influence Function (FIF), and compute it as a scaled difference of conditional variances in the prediction of the classifier. We also instantiate an algorithm, FairXplainer, that uses variance decomposition among the subset of features and a local regressor to compute FIFs accurately, while also capturing the intersectional effects of the features. Our experimental analysis validates that FairXplainer captures the influences of both individual features and higher-order feature interactions, estimates the bias more accurately than existing local explanation methods, and detects the increase/decrease in bias due to affirmative/punitive actions in the classifier.  ( 2 min )
    Bayesian Learning to Discover Mathematical Operations in Governing Equations of Dynamic Systems. (arXiv:2206.00669v1 [cs.LG])
    Discovering governing equations from data is critical for diverse scientific disciplines as they can provide insights into the underlying phenomenon of dynamic systems. This work presents a new representation for governing equations by designing the Mathematical Operation Network (MathONet) with a deep neural network-like hierarchical structure. Specifically, the MathONet is stacked by several layers of unary operations (e.g., sin, cos, log) and binary operations (e.g., +,-), respectively. An initialized MathONet is typically regarded as a super-graph with a redundant structure, a sub-graph of which can yield the governing equation. We develop a sparse group Bayesian learning algorithm to extract the sub-graph by employing structurally constructed priors over the redundant mathematical operations. By demonstrating the chaotic Lorenz system, Lotka-Volterra system, and Kolmogorov-Petrovsky-Piskunov system, the proposed method can discover the ordinary differential equations (ODEs) and partial differential equations (PDEs) from the observations given limited mathematical operations, without any prior knowledge on possible expressions of the ODEs and PDEs.  ( 2 min )

  • Open

    Can anyone help me with this brainstormed idea? I'm very new to RL.
    Hello all, As the title states, I am really new to RL. I have been working on one project and I'll need to create a custom environment. The environment will not be made from an image with pixels. Instead, the environment will be constructed out of nodes and edges--a network. They will be defined by their relationship to each other (i.e., edge 1 connects nodes A and B). The agent will travel from node to node along edges. How might I get that to work; I already have thee data. submitted by /u/professorDissociate [link] [comments]  ( 1 min )
    Does env.reset openai gym randomly reinitialize the environment?
    submitted by /u/No_Possibility_7588 [link] [comments]  ( 1 min )
    Anyone know any accessible guides to using TF-agents for bandit based problems?
    I know TensorFlow has some well documented tutorials but I am getting confused and stuck with developing for my side project. Ideally I would like to chat with someone on Discord to help me out. Thanks! submitted by /u/WirrryWoo [link] [comments]  ( 1 min )
    PyBullet objects act abnormally and I don't know why
    Hello I am fairly new with pybullet and I am having some issue with my simulation. Basically, there is a cube on a table and as soon as a robot tries to touch it the cube acts abnormally and "enters" the table. I will attach a small movie so you can see it. ​ https://reddit.com/link/v2wtcz/video/hixholvlx3391/player Link for .urdf and .obj of TABLE Link for .urdf and .obj of CUBE I am using pybullet in python, let me know if you would want me to share more information. Thanks! submitted by /u/gabrigoo [link] [comments]  ( 1 min )
    Where do you intern?
    I am an RL guy, I found it’s hard to get an RL internship. Only few really big companies like Microsoft, NVidia, Google, Tesla, etc. Is there any other opportunities in not-so-big companies where I could find an RL internship submitted by /u/Blasphemer666 [link] [comments]  ( 1 min )
  • Open

    Amazon SageMaker Notebook Instances now support configuring and restricting IMDS versions
    Today, we’re excited to announce that Amazon SageMaker now supports the ability to configure Instance Metadata Service Version 2 (IMDSv2) for Notebook Instances, and for administrators to control the minimum version with which end-users create new Notebook Instances. You can now choose IMDSv2 only for your new and existing SageMaker Notebook Instances to take advantage […]  ( 6 min )
    Reimagine search on GitHub repositories with the power of the Amazon Kendra GitHub connector
    Amazon Kendra offers highly accurate semantic and natural language search powered by machine learning (ML). Many organizations use GitHub as a code hosting platform for version control and to redefine collaboration of open-source software projects. A GitHub account repository might include many content types, such as files, issues, issue comments, issue comment attachments, pull requests, […]  ( 8 min )
    Merge cells and column headers in Amazon Textract tables
    Financial documents such as bank, loan, or mortgage statements are often formatted to be visually appealing and easy to read for the human eye. These same features can also make automated processing challenging at times. For instance, in the following sample statement, merging rows or columns in a table helps reduce information redundancy, but it […]  ( 5 min )
    Detect financial transaction fraud using a Graph Neural Network with Amazon SageMaker
    Fraud plagues many online businesses and costs them billions of dollars each year. Financial fraud, counterfeit reviews, bot attacks, account takeovers, and spam are all examples of online fraud and malicious behaviors. Although many businesses take approaches to combat online fraud, these existing approaches can have severe limitations. First, many existing methods aren’t sophisticated or […]  ( 10 min )
  • Open

    [D] Looking for recommended papers on document Key-Value extraction models
    Looking for recommended papers on general document Key-value extraction. My searches are coming up mostly with papers that rely on highly domain-specific heuristics. Any good places to start my search would be appreciated. submitted by /u/piccalillihighlands [link] [comments]  ( 1 min )
    [P] What if AB testing is impossible to setup? I wrote a blog to measure impact using backdoor adjustment, a type of causal analysis
    To ensure that every feature has a measurable impact on the broader platform my team will set up and run A/B testing on each new feature or product change, but what happens when a new feature needs to be released quickly and there is not enough time for a traditional testing approach? To make sure that these quick changes could still be measured I found a way to perform accurate pre-post analysis using a back-door adjustment of causal analysis. I wanted to share my findings with the community as it was able to help my team at DoorDash make quick bug fixes and still be able to measure the impact. Please check out the article to get the technical details and provide any feedback on my approach. https://doordash.engineering/2022/06/02/using-back-door-adjustment-causal-analysis-to-measure-pre-post-effects/ submitted by /u/tripleespresso7 [link] [comments]  ( 1 min )
    [D] Building a AI training cluster
    Hey there. We are about to buy hardware for training models. I'm eyeing an HP DL580 G8 because of a good deal on it. We also plan to scale up and add more servers to create a compute cluster. The server comes with 4 cpu's and 512gb of ram, this will certainly not all be used to train one model. The question here becomes what is the best solution: Install VMware, poxmox, Truenas for virtualisation and passing the gpu to the vm or does this limit the training in any way? or rather just install ubuntu server on it? Any recommendations what hypervisor/OS to use? or resources about how other people setup their servers for machine learning? Thanks in advance. submitted by /u/Joytimmermans [link] [comments]  ( 1 min )
    [R] Towards artificial general intelligence via a multimodal foundation model (Nature)
    This is published in Nature, so supposedly more notable than yet another multimodal experiment. But the way the article presents the results, leaves me confused about how this compares and contrasts to e.g. DeepMind Gato? https://www.nature.com/articles/s41467-022-30761-2 submitted by /u/valdanylchuk [link] [comments]  ( 2 min )
    [D] Recommendation System based on DNN (Softmax Model)
    So, I was going through Recommendation system google's colab and couldn't understand that for training matrix factorization model, we are giving both user embeddings and movie embeddings to CFModel class. But in case of Softmax model we are only giving movie embeddings. Why not user embeddings too? I am putting only relevant part of code. Full code can be seen in the link. Matrix Factorization Model embeddings = { "user_id": U, "movie_id": V } return CFModel(embeddings, train_loss, [metrics]) Softmax Model : metrics = ( {"train_loss": train_loss, "test_loss": test_loss}, {"test_precision_at_10": test_precision_at_10} ) embeddings = {"movie_id": movie_embeddings} return CFModel(embeddings, train_loss, metrics) submitted by /u/Expensive_build [link] [comments]  ( 1 min )
    [N] FAIR gets "decentralilzed"
    Announcement: https://ai.facebook.com/blog/building-with-ai-across-all-of-meta/ In the new model we will distribute the ownership of these AI systems back to Meta’s product groups. But we do so with the caveat that they must invest in a balanced portfolio that supports existing systems while also advancing the state of the art in AI. submitted by /u/MassivePellfish [link] [comments]  ( 1 min )
    [D] Pretrained SOTA Medical Classification Models for Download
    I'm doing some research that requires a state of the art medical image classifier for comparison. Can anyone point me in the direction of a performant classifier that has been trained on medical images (any type of medical image e.g. CT scan, x-ray, histopathology, etc.)? submitted by /u/i_wasserman [link] [comments]
    [P] Live Video Inference
    Hi everyone, I have a dataset of annotated videos for binary classification. I.e. each video just has one label: positive or negative. Right now I have a trained model in place for taking a complete video as input and outputting a classification. So far so good. Now the clients I'm working with want a feature for inference on live videos as well. I'm sort of at a loss on how to achieve this with the dataset I'm given. Since live video inference would entail inferring frame by frame and I don't actually have any labelled frames (only the entire video is given a label). So what I'm thinking is that I have to somehow figure out which frames in the positively-labelled videos are the ones to actually make the video positive. Is there a theory for this kind of problem? Here's what I've considered so far: 1) Use an anomaly detection algorithm to figure out which frames are anomalous to the others, and use those frames as positive frames for training the frame-by-frame model. 2) This is a long shot but you take out one frame at a time for the positive videos and infer on the new video with that frame missing. Then see which frame(s) makes the confidence score for positivity drop the most after being taken out. I'm sure there are better ideas out there. Anyone have any suggestions or papers I could look into about this kind of problem? submitted by /u/diningeachox [link] [comments]  ( 1 min )
    [N] [P] Machine Learning in dbt DAG
    Today, we are super excited to announce our open-source Layer DBT Bigquery Adapter which runs ML pipelines inside dbt DAG with BigQuery (more DWH support coming soon...) as the backing data warehouse. It's in a very early stage but we wanted share it with you to get your feedback. SELECT id, layer.predict("layer/clothing/models/objectdetection", ARRAY[image]) FROM {{ ref("products") }} 📷 With Layer dbt Adapter you can: Score your data with a machine learning model from Layer with SQL. Train an AutoML model with your data [coming soon...] Train a custom machine learning model with your data [coming soon...] 📷 Dive into dbt examples: Predicting survials of Titanic - End to end ML pipeline (feature extraction+scoring) which predicts the survivals of the Titanic disaster. Sentiment analysis of product reviews - An example that shows how to make multi-language sentiment analysis. Object detection in product images - Detects cloths from product images using a pretrained computer vision model. We would love your feedback on our new open-source tool! It will highly influence our roadmap.Thank you! submitted by /u/mehmetecevit [link] [comments]  ( 1 min )
    [P] A domain adaptation library that I wrote: PyTorch Adapt
    I wrote a library for domain adaptation, which is a type of machine learning that repurposes existing models to work in new domains. A toy example is adapting a model trained on MNIST for use on colored digits. The library is modular, so you can drop an algorithm into your training for-loop like this: hook = DANNHook(optimizers) for data in tqdm(dataloader): data = batch_to_device(data, device) # Optimization is done inside the hook. # The returned loss is for logging. _, loss = hook({**models, **data}) One challenge is that domain adaptation algorithms come in many different forms, making it difficult to write hooks that can be re-used across multiple algorithms. Suppose you want to add a loss function, "Foo", to the DANN algorithm. You want to apply Foo to logits from both source an…  ( 1 min )
    [Project] BFLOAT16 on ALL hardware (>= 2009), up to 2000x faster ML algos, 50% less RAM usage for all old/new hardware - Hyperlearn Reborn.
    Hello everyone!! It's been a while!! Years back I released Hyperlearn https://github.com/danielhanchen/hyperlearn. It has 1.2K Github stars, where I made tonnes of algos faster. I was a bit busy back at NVIDIA and my startup, and I've been casually developing some algos. The question is are people still interested in fast algorithms? Does anyone want to collaborate on reviving Hyperlearn? (Or making a NEW package?) Note the current package is ahhh A MESSS... I'm fixing it - sit tight!! NEW algos for release: PCA with 50% less memory usage with ZERO data corruption!! (Maths tricks :)) (ie no need to do X - X.mean()!!!)) How you may ask???! Randomized PCA with 50% less memory usage (ie no need to do X - X.mean()). Linear Regression is EVEN faster with now Pivoted Cholesky making algo …  ( 4 min )
    [P] TinyML: Slope control for Robots with Arduino and Neuton AI
    Today smart household appliances can be found in almost every home as they greatly simplify our daily routine. A vivid example of such a gadget is a vacuum cleaner robot, representing a concentration of technology: a complex embedded system composed of some microcontrollers, many sensors, and a lot of… software! But how many times does your little helper behave stupidly and block itself over obstacles? I found a solution to this problem and implemented an inclination estimator system based on an accelerometer using a TinyML model on the Arduino’s Nicla Sense ME. In my tutorial, I’ll provide step-by-step guidelines on how to set up Nicla and approach the task in two ways (by building regression and multiclassification models) using a free TinyML framework that allows to automatically build neural networks and deploy them on small computing devices, such as Nicla. Check out the full version of my experiment here:https://www.hackster.io/leonardocavagnis/tinyml-slope-control-for-robots-with-arduino-pro-485061 submitted by /u/Leo_Cav [link] [comments]  ( 1 min )
    [P] How do I do preprocessing on a flutter app
    So me and my team are making an audio classification app for android. We used a python backend connected to the flutter app for the actual classification part, but now we want to get rid of that and do it all in flutter with tflite. Problem is, we relied on librosa for our data preprocessing (getting a mel spectrogram) and we can't find any libraries to get mel spectrograms in flutter. Does anyone here know of one? Or can recommend another way to preprocess for our tflite model? submitted by /u/initiald-ejavu [link] [comments]  ( 2 min )
    [D] Inputs on scalable cost effective pipeline
    Hi, all. I have multiple deep learning/machine learning / naive based tasks that I want to deploy online through an API. I have been trying to figure out the best way to do it for some time, but I am overwhelmed by the number of different frameworks and packages available on AWS and GCP. Multiple tools on both platforms seem to have overlapping responsibilities with unclear limitations, making it hard to choose. I want to obtain a scalable pipeline that saves as much money as possible (using, for example, spot pricing) and is easily expandable with new components. My idea was to use celery and create a task for each different data processing method I have. The APIs would simply add an entry into the celery's queue, and the workers would take care of the rest. Scaling up or down the pipeline would be just a matter of adding or removing celery workers, then. ​ How would you approach the problem? Do you know of any resources worth readying to build an architecture like this? Is there any particular instrument on AWS or GCP that would allow me to easily take care of this task? submitted by /u/assassin_canederlo [link] [comments]  ( 2 min )
    [D] Can LLMs be updated, e.g. to follow the daily news? Are there any such regularly updated models publicly known yet?
    Some context can be provided in the prompt, but for the bigger picture it is insufficient. I understand companies will not release anything like it until they solve the bias/censorship issues somehow, but did anyone mention an internal demo, or a project in progress? Or is there a lower scale open source experimental project? It would be so much fun to get some summaries or Q&A on the current events, latest science/tech developments, etc. Edit (thanks u/adt): There are projects trying to connect a language model to Internet and/or some add-on memory for facts. For example, WebGPT (which might be on its way to a product launch), BlenderBot 2.0 by Meta, and Jurassic-X by AI21. submitted by /u/valdanylchuk [link] [comments]  ( 2 min )
    [D] Uncertainty quantification
    Hi all, here is a link to an article I wrote on uncertainty in machine learning and I explain how to quantify uncertainty using uq360 metamodels. https://medium.com/total-digital-factory/how-to-add-confidence-to-your-machine-learning-models-b1228217858e Feel free to reply and suggest any changes submitted by /u/islem75ds [link] [comments]
  • Open

    Subreddits for text to image discussion? (Bees playing volleyball at the beach)
    submitted by /u/meromachin [link] [comments]
    In this article we show you how to use Bert transformer with spacy3 to train joint entities and relation extraction classifier
    submitted by /u/UBIAI [link] [comments]
    Is the Shazam AI an example of supervised or unsupervised learning?
    I think supervised but im not sure. submitted by /u/JustinFieldsBurner [link] [comments]
    Neuralink Update – May 2022 (Part 2)
    submitted by /u/1024cities [link] [comments]
    Best books for CV/ CV applications in Retail
    Experts here, please help me with the best books written on Computer Visions applications in Retail. Planning to get a hang of the capabilities of the current state of CV technology. Thanks folks! submitted by /u/blazingkrishna [link] [comments]
    Have an .EDU address? Free Deep Learning Cert from Nvidia!
    Hey fellow Redditors, ​ Just thought we would share a cool invite from Nvidia! ​ During the Month of June, Nvidia is offering all students & educators a free Deep Learning workshop with a certificate of completion. ​ NVIDIA DEEP LEARNING INSTITUTE submitted by /u/PSSC-Labs [link] [comments]
    "Eat bug." - Increasingly Verbose bot with GPT-3
    Original input: Eat bug. Final output: Did you know that eating bugs is actually really good for you? You may think that sounds gross, but bugs are packed with nutrients and are a great source of protein. Plus, they are sustainable and environmentally-friendly. So why not give it a try? There are lots of ways to eat bugs, like roasted, in a soup, or even in a smoothie. Trust me, you won't even taste the difference. Eating bugs is a great way to get your protein intake up without harming the environment. Bugs are a sustainable source of food that are packed with nutrients. You can eat them roasted, in a soup, or even in a smoothie. Trust me, you won't even taste the difference. So why not give it a try? Roasting bugs is a great way to make them crispy and delicious. Just toss them in some olive oil and sea salt, and roast them in the oven at 400 degrees for 20 minutes. Then, add them to your favorite salad or enjoy them as a snack. If you're looking for a heartier meal, try making a soup with bugs. This is a great way to get all the nutrients and protein that bugs have to offer. Just add some bugs to your favorite soup recipe, or try out a new one. You can even add bugs to your morning smoothie. This is a great way to get all the nutrients from bugs without even tasting them. Just add a handful of bugs to your favorite smoothie recipe and blend away. You won't even know they're there! Eating bugs is a great way to get your protein intake up without harming the environment. Bugs are a sustainable source of food that are packed with nutrients. So why not give it a try? ​ https://youtu.be/TxYSUwJjmYQ ​ https://github.com/daveshap/IncreasinglyVerbose submitted by /u/DavidKShapiro [link] [comments]  ( 1 min )
    is there any ai I can use for free to simulate hypothetical wars ??
    Like a downloable program or something online ?? submitted by /u/Warm_Lingonberry_204 [link] [comments]  ( 1 min )
    All roads lead to AGI
    submitted by /u/bendee983 [link] [comments]
    Crystals: GPT-3 Explanation Narrated and Visualized in [4K 60 FPS] w/ VQGAN + CLIP
    submitted by /u/MLInsights [link] [comments]
    9+ Best Computer Vision Books for Beginners & advance to read in 2022
    submitted by /u/Lakshmireddys [link] [comments]
    MLEM - The First Open, Git-based Machine Learning Model Deployment and Management Tool Introduced
    submitted by /u/thumbsdrivesmecrazy [link] [comments]
    Fun with the inspoirobot generator
    ​ https://preview.redd.it/tr9a2ejdz3391.jpg?width=650&format=pjpg&auto=webp&s=073144ff240dd12944739dff14be3f40c5c95089 https://preview.redd.it/0v8ckocez3391.jpg?width=650&format=pjpg&auto=webp&s=7b5fab132a6b0ef325d18f7a464d3b0e5f79ef80 https://preview.redd.it/hg8w7cbfz3391.jpg?width=650&format=pjpg&auto=webp&s=36691b1817f1d0b7423c14bfb52b5132215570cf https://preview.redd.it/g6qv1htbz3391.jpg?width=650&format=pjpg&auto=webp&s=87c75d36b0fda947f5ef0a088634a269358761ca https://preview.redd.it/i12cthhbz3391.jpg?width=650&format=pjpg&auto=webp&s=0d92f86351cc2087cda012c23bc4da29b7085682 https://preview.redd.it/3wel9r5hz3391.jpg?width=650&format=pjpg&auto=webp&s=5843e4cc06648f9c244a02bafbe7b6b5b3b356da https://preview.redd.it/590vgluhz3391.jpg?width=650&format=pjpg&auto=webp&s=07a58029b2fe61b0e17c733adc35526981bd6e89 submitted by /u/Difficult-Reality103 [link] [comments]
    Mandala - GPT-3 explains
    submitted by /u/MLInsights [link] [comments]
  • Open

    Recommendation system using Deep neural network based on Softmax Model
    So, I was going through Recommendation system google's colab and couldn't understand that for training matrix factorization model, we are giving both user embeddings and movie embeddings to CFModel class. But in case of Softmax model we are only giving movie embeddings. Why not user embeddings too? I am putting only relevant part of code. Full code can be seen in the link. Matrix Factorization Model embeddings = { "user_id": U, "movie_id": V } return CFModel(embeddings, train_loss, [metrics]) Softmax Model : metrics = ( {"train_loss": train_loss, "test_loss": test_loss}, {"test_precision_at_10": test_precision_at_10} ) embeddings = {"movie_id": movie_embeddings} return CFModel(embeddings, train_loss, metrics) submitted by /u/Expensive_build [link] [comments]  ( 1 min )
    How to Make the Universe Think for Us: Physicists are building neural networks out of vibrations, voltages and lasers
    submitted by /u/nickb [link] [comments]  ( 1 min )
  • Open

    Best Practices for Deploying Language Models
    Cohere, OpenAI, and AI21 Labs have developed a preliminary set of best practices applicable to any organization developing or deploying large language models. Computers that can read and write are here, and they have the potential to fundamentally impact daily life. The future of human–machine interaction is full  ( 3 min )
  • Open

    JONA-ROBOT THE PROPHETESS, WHALE LABORATORY & HOPE
    On Techno-Prophets in The Service of Humanity Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 1 min )
  • Open

    GFN Thursday Jumps Into June With 25 New Games Coming This Month
    Celebrate the onset of summer this GFN Thursday with 25 more games joining the GeForce NOW library, including seven additions this week. Because why would you ever go outside? Looking to spend the summer months in Space Marine armor? Games Workshop is kicking off its Warhammer Skulls event for its sixth year, with great discounts Read article > The post GFN Thursday Jumps Into June With 25 New Games Coming This Month appeared first on NVIDIA Blog.  ( 3 min )
  • Open

    DisPFL: Towards Communication-Efficient Personalized Federated Learning via Decentralized Sparse Training. (arXiv:2206.00187v1 [cs.LG])
    Personalized federated learning is proposed to handle the data heterogeneity problem amongst clients by learning dedicated tailored local models for each user. However, existing works are often built in a centralized way, leading to high communication pressure and high vulnerability when a failure or an attack on the central server occurs. In this work, we propose a novel personalized federated learning framework in a decentralized (peer-to-peer) communication protocol named Dis-PFL, which employs personalized sparse masks to customize sparse local models on the edge. To further save the communication and computation cost, we propose a decentralized sparse training technique, which means that each local model in Dis-PFL only maintains a fixed number of active parameters throughout the whole local training and peer-to-peer communication process. Comprehensive experiments demonstrate that Dis-PFL significantly saves the communication bottleneck for the busiest node among all clients and, at the same time, achieves higher model accuracy with less computation cost and communication rounds. Furthermore, we demonstrate that our method can easily adapt to heterogeneous local clients with varying computation complexities and achieves better personalized performances.  ( 2 min )
    On Gap-dependent Bounds for Offline Reinforcement Learning. (arXiv:2206.00177v1 [cs.LG])
    This paper presents a systematic study on gap-dependent sample complexity in offline reinforcement learning. Prior work showed when the density ratio between an optimal policy and the behavior policy is upper bounded (the optimal policy coverage assumption), then the agent can achieve an $O\left(\frac{1}{\epsilon^2}\right)$ rate, which is also minimax optimal. We show under the optimal policy coverage assumption, the rate can be improved to $O\left(\frac{1}{\epsilon}\right)$ when there is a positive sub-optimality gap in the optimal $Q$-function. Furthermore, we show when the visitation probabilities of the behavior policy are uniformly lower bounded for states where an optimal policy's visitation probabilities are positive (the uniform optimal policy coverage assumption), the sample complexity of identifying an optimal policy is independent of $\frac{1}{\epsilon}$. Lastly, we present nearly-matching lower bounds to complement our gap-dependent upper bounds.  ( 2 min )
    Multi-Armed Bandit Problem with Temporally-Partitioned Rewards: When Partial Feedback Counts. (arXiv:2206.00586v1 [cs.LG])
    There is a rising interest in industrial online applications where data becomes available sequentially. Inspired by the recommendation of playlists to users where their preferences can be collected during the listening of the entire playlist, we study a novel bandit setting, namely Multi-Armed Bandit with Temporally-Partitioned Rewards (TP-MAB), in which the stochastic reward associated with the pull of an arm is partitioned over a finite number of consecutive rounds following the pull. This setting, unexplored so far to the best of our knowledge, is a natural extension of delayed-feedback bandits to the case in which rewards may be dilated over a finite-time span after the pull instead of being fully disclosed in a single, potentially delayed round. We provide two algorithms to address TP-MAB problems, namely, TP-UCB-FR and TP-UCB-EW, which exploit the partial information disclosed by the reward collected over time. We show that our algorithms provide better asymptotical regret upper bounds than delayed-feedback bandit algorithms when a property characterizing a broad set of reward structures of practical interest, namely alpha-smoothness, holds. We also empirically evaluate their performance across a wide range of settings, both synthetically generated and from a real-world media recommendation problem.  ( 2 min )
    Vietnamese Hate and Offensive Detection using PhoBERT-CNN and Social Media Streaming Data. (arXiv:2206.00524v1 [cs.CL])
    Society needs to develop a system to detect hate and offense to build a healthy and safe environment. However, current research in this field still faces four major shortcomings, including deficient pre-processing techniques, indifference to data imbalance issues, modest performance models, and lacking practical applications. This paper focused on developing an intelligent system capable of addressing these shortcomings. Firstly, we proposed an efficient pre-processing technique to clean comments collected from Vietnamese social media. Secondly, a novel hate speech detection (HSD) model, which is the combination of a pre-trained PhoBERT model and a Text-CNN model, was proposed for solving tasks in Vietnamese. Thirdly, EDA techniques are applied to deal with imbalanced data to improve the performance of classification models. Besides, various experiments were conducted as baselines to compare and investigate the proposed model's performance against state-of-the-art methods. The experiment results show that the proposed PhoBERT-CNN model outperforms SOTA methods and achieves an F1-score of 67,46% and 98,45% on two benchmark datasets, ViHSD and HSD-VLSP, respectively. Finally, we also built a streaming HSD application to demonstrate the practicality of our proposed system.  ( 2 min )
    Provably Efficient Offline Multi-agent Reinforcement Learning via Strategy-wise Bonus. (arXiv:2206.00159v1 [cs.LG])
    This paper considers offline multi-agent reinforcement learning. We propose the strategy-wise concentration principle which directly builds a confidence interval for the joint strategy, in contrast to the point-wise concentration principle that builds a confidence interval for each point in the joint action space. For two-player zero-sum Markov games, by exploiting the convexity of the strategy-wise bonus, we propose a computationally efficient algorithm whose sample complexity enjoys a better dependency on the number of actions than the prior methods based on the point-wise bonus. Furthermore, for offline multi-agent general-sum Markov games, based on the strategy-wise bonus and a novel surrogate function, we give the first algorithm whose sample complexity only scales $\sum_{i=1}^mA_i$ where $A_i$ is the action size of the $i$-th player and $m$ is the number of players. In sharp contrast, the sample complexity of methods based on the point-wise bonus would scale with the size of the joint action space $\Pi_{i=1}^m A_i$ due to the curse of multiagents. Lastly, all of our algorithms can naturally take a pre-specified strategy class $\Pi$ as input and output a strategy that is close to the best strategy in $\Pi$. In this setting, the sample complexity only scales with $\log |\Pi|$ instead of $\sum_{i=1}^mA_i$.  ( 2 min )
    Graph Machine Learning for Design of High-Octane Fuels. (arXiv:2206.00619v1 [cs.LG])
    Fuels with high-knock resistance enable modern spark-ignition engines to achieve high efficiency and thus low CO2 emissions. Identification of molecules with desired autoignition properties indicated by a high research octane number and a high octane sensitivity is therefore of great practical relevance and can be supported by computer-aided molecular design (CAMD). Recent developments in the field of graph machine learning (graph-ML) provide novel, promising tools for CAMD. We propose a modular graph-ML CAMD framework that integrates generative graph-ML models with graph neural networks and optimization, enabling the design of molecules with desired ignition properties in a continuous molecular space. In particular, we explore the potential of Bayesian optimization and genetic algorithms in combination with generative graph-ML models. The graph-ML CAMD framework successfully identifies well-established high-octane components. It also suggests new candidates, one of which we experimentally investigate and use to illustrate the need for further auto-ignition training data.  ( 2 min )
    A Simple Structure For Building A Robust Model. (arXiv:2204.11596v2 [cs.CV] UPDATED)
    As deep learning applications, especially programs of computer vision, are increasingly deployed in our lives, we have to think more urgently about the security of these applications.One effective way to improve the security of deep learning models is to perform adversarial training, which allows the model to be compatible with samples that are deliberately created for use in attacking the model.Based on this, we propose a simple architecture to build a model with a certain degree of robustness, which improves the robustness of the trained network by adding an adversarial sample detection network for cooperative training. At the same time, we design a new data sampling strategy that incorporates multiple existing attacks, allowing the model to adapt to many different adversarial attacks with a single training.We conducted some experiments to test the effectiveness of this design based on Cifar10 dataset, and the results indicate that it has some degree of positive effect on the robustness of the model.Our code could be found at https://github.com/dowdyboy/simple_structure_for_robust_model .  ( 2 min )
    Temporal Multiresolution Graph Neural Networks For Epidemic Prediction. (arXiv:2205.14831v2 [cs.LG] UPDATED)
    In this paper, we introduce Temporal Multiresolution Graph Neural Networks (TMGNN), the first architecture that both learns to construct the multiscale and multiresolution graph structures and incorporates the time-series signals to capture the temporal changes of the dynamic graphs. We have applied our proposed model to the task of predicting future spreading of epidemic and pandemic based on the historical time-series data collected from the actual COVID-19 pandemic and chickenpox epidemic in several European countries, and have obtained competitive results in comparison to other previous state-of-the-art temporal architectures and graph learning algorithms. We have shown that capturing the multiscale and multiresolution structures of graphs is important to extract either local or global information that play a critical role in understanding the dynamic of a global pandemic such as COVID-19 which started from a local city and spread to the whole world. Our work brings a promising research direction in forecasting and mitigating future epidemics and pandemics.  ( 2 min )
    Variance Reduction is an Antidote to Byzantines: Better Rates, Weaker Assumptions and Communication Compression as a Cherry on the Top. (arXiv:2206.00529v1 [cs.LG])
    Byzantine-robustness has been gaining a lot of attention due to the growth of the interest in collaborative and federated learning. However, many fruitful directions, such as the usage of variance reduction for achieving robustness and communication compression for reducing communication costs, remain weakly explored in the field. This work addresses this gap and proposes Byz-VR-MARINA - a new Byzantine-tolerant method with variance reduction and compression. A key message of our paper is that variance reduction is key to fighting Byzantine workers more effectively. At the same time, communication compression is a bonus that makes the process more communication efficient. We derive theoretical convergence guarantees for Byz-VR-MARINA outperforming previous state-of-the-art for general non-convex and Polyak-Lojasiewicz loss functions. Unlike the concurrent Byzantine-robust methods with variance reduction and/or compression, our complexity results are tight and do not rely on restrictive assumptions such as boundedness of the gradients or limited compression. Moreover, we provide the first analysis of a Byzantine-tolerant method supporting non-uniform sampling of stochastic gradients. Numerical experiments corroborate our theoretical findings.  ( 2 min )
    Taming Continuous Posteriors for Latent Variational Dialogue Policies. (arXiv:2205.07633v2 [cs.CL] UPDATED)
    Utilizing amortized variational inference for latent-action reinforcement learning (RL) has been shown to be an effective approach in Task-oriented Dialogue (ToD) systems for optimizing dialogue success. Until now, categorical posteriors have been argued to be one of the main drivers of performance. In this work we revisit Gaussian variational posteriors for latent-action RL and show that they can yield even better performance than categoricals. We achieve this by simplifying the training procedure and propose ways to regularize the latent dialogue policy to retain good response coherence. Using continuous latent representations our model achieves state of the art dialogue success rate on the MultiWOZ benchmark, and also compares well to categorical latent methods in response coherence.  ( 2 min )
    A multimodal model with Twitter FinBERT embeddings for extreme price movement prediction of Bitcoin. (arXiv:2206.00648v1 [q-fin.ST])
    Bitcoin, with its ever-growing popularity, has demonstrated extreme price volatility since its origin. This volatility, together with its decentralised nature, make Bitcoin highly subjective to speculative trading as compared to more traditional assets. In this paper, we propose a multimodal model for predicting extreme price fluctuations. This model takes as input a variety of correlated assets, technical indicators, as well as Twitter content. In an in-depth study, we explore whether social media discussions from the general public on Bitcoin have predictive power for extreme price movements. A dataset of 5,000 tweets per day containing the keyword `Bitcoin' was collected from 2015 to 2021. This dataset, called PreBit, is made available online. In our hybrid model, we use sentence-level FinBERT embeddings, pretrained on financial lexicons, so as to capture the full contents of the tweets and feed it to the model in an understandable way. By combining these embeddings with a Convolutional Neural Network, we built a predictive model for significant market movements. The final multimodal ensemble model includes this NLP model together with a model based on candlestick data, technical indicators and correlated asset prices. In an ablation study, we explore the contribution of the individual modalities. Finally, we propose and backtest a trading strategy based on the predictions of our models with varying prediction threshold and show that it can used to build a profitable trading strategy with a reduced risk over a `hold' or moving average strategy.  ( 2 min )
    Contextual Bandits with Knapsacks for a Conversion Model. (arXiv:2206.00314v1 [cs.LG])
    We consider contextual bandits with knapsacks, with an underlying structure between rewards generated and cost vectors suffered. We do so motivated by sales with commercial discounts. At each round, given the stochastic i.i.d.\ context $\mathbf{x}_t$ and the arm picked $a_t$ (corresponding, e.g., to a discount level), a customer conversion may be obtained, in which case a reward $r(a,\mathbf{x}_t)$ is gained and vector costs $c(a_t,\mathbf{x}_t)$ are suffered (corresponding, e.g., to losses of earnings). Otherwise, in the absence of a conversion, the reward and costs are null. The reward and costs achieved are thus coupled through the binary variable measuring conversion or the absence thereof. This underlying structure between rewards and costs is different from the linear structures considered by Agrawal and Devanur [2016] but we show that the techniques introduced in this article may also be applied to the latter case. Namely, the adaptive policies exhibited solve at each round a linear program based on upper-confidence estimates of the probabilities of conversion given $a$ and $\mathbf{x}$. This kind of policy is most natural and achieves a regret bound of the typical order (OPT/$B$) $\sqrt{T}$, where $B$ is the total budget allowed, OPT is the optimal expected reward achievable by a static policy, and $T$ is the number of rounds.  ( 2 min )
    Federated Learning in Satellite Constellations. (arXiv:2206.00307v1 [cs.IT])
    Distributed machine learning (DML) results from the synergy between machine learning and connectivity. Federated learning (FL) is a prominent instance of DML in which intermittently connected mobile clients contribute to the training of a common learning model. This paper presents the new context brought to FL by satellite constellations where the connectivity patterns are significantly different from the ones assumed in terrestrial FL. We provide a taxonomy of different types of satellite connectivity relevant for FL and show how the distributed training process can overcome the slow convergence due to long offline times of clients by taking advantage of the predictable intermittency of the satellite communication links.  ( 2 min )
    Realistic Deep Learning May Not Fit Benignly. (arXiv:2206.00501v1 [cs.LG])
    Studies on benign overfitting provide insights for the success of overparameterized deep learning models. In this work, we examine the benign overfitting phenomena in real-world settings. We found that for tasks such as training a ResNet model on ImageNet dataset, the model does not fit benignly. To understand why benign overfitting fails in the ImageNet experiment, we analyze previous benign overfitting models under a more restrictive setup where the number of parameters is not significantly larger than the number of data points. Under this mild overparameterization setup, our analysis identifies a phase change: unlike in the heavy overparameterization setting, benign overfitting can now fail in the presence of label noise. Our study explains our empirical observations, and naturally leads to a simple technique known as self-training that can boost the model's generalization performances. Furthermore, our work highlights the importance of understanding implicit bias in underfitting regimes as a future direction.  ( 2 min )
    Higher-Order Attention Networks. (arXiv:2206.00606v1 [cs.LG])
    This paper introduces higher-order attention networks (HOANs), a novel class of attention-based neural networks defined on a generalized higher-order domain called a combinatorial complex (CC). Similar to hypergraphs, CCs admit arbitrary set-like relations between a collection of abstract entities. Simultaneously, CCs permit the construction of hierarchical higher-order relations analogous to those supported by cell complexes. Thus, CCs effectively generalize both hypergraphs and cell complexes and combine their desirable characteristics. By exploiting the rich combinatorial nature of CCs, HOANs define a new class of message-passing attention-based networks that unifies higher-order neural networks. Our evaluation on tasks related to mesh shape analysis and graph learning demonstrates that HOANs attain competitive, and in some examples superior, predictive performance in comparison to state-of-the-art neural networks.  ( 2 min )
    Control of Two-way Coupled Fluid Systems with Differentiable Solvers. (arXiv:2206.00342v1 [cs.LG])
    We investigate the use of deep neural networks to control complex nonlinear dynamical systems, specifically the movement of a rigid body immersed in a fluid. We solve the Navier Stokes equations with two way coupling, which gives rise to nonlinear perturbations that make the control task very challenging. Neural networks are trained in an unsupervised way to act as controllers with desired characteristics through a process of learning from a differentiable simulator. Here we introduce a set of physically interpretable loss terms to let the networks learn robust and stable interactions. We demonstrate that controllers trained in a canonical setting with quiescent initial conditions reliably generalize to varied and challenging environments such as previously unseen inflow conditions and forcing, although they do not have any fluid information as input. Further, we show that controllers trained with our approach outperform a variety of classical and learned alternatives in terms of evaluation metrics and generalization capabilities.  ( 2 min )
    Decentralized Competing Bandits in Non-Stationary Matching Markets. (arXiv:2206.00120v1 [stat.ML])
    Understanding complex dynamics of two-sided online matching markets, where the demand-side agents compete to match with the supply-side (arms), has recently received substantial interest. To that end, in this paper, we introduce the framework of decentralized two-sided matching market under non stationary (dynamic) environments. We adhere to the serial dictatorship setting, where the demand-side agents have unknown and different preferences over the supply-side (arms), but the arms have fixed and known preference over the agents. We propose and analyze a decentralized and asynchronous learning algorithm, namely Decentralized Non-stationary Competing Bandits (\texttt{DNCB}), where the agents play (restrictive) successive elimination type learning algorithms to learn their preference over the arms. The complexity in understanding such a system stems from the fact that the competing bandits choose their actions in an asynchronous fashion, and the lower ranked agents only get to learn from a set of arms, not \emph{dominated} by the higher ranked agents, which leads to \emph{forced exploration}. With carefully defined complexity parameters, we characterize this \emph{forced exploration} and obtain sub-linear (logarithmic) regret of \texttt{DNCB}. Furthermore, we validate our theoretical findings via experiments.
    MAD-EN: Microarchitectural Attack Detection through System-wide Energy Consumption. (arXiv:2206.00101v1 [cs.CR])
    Microarchitectural attacks have become more threatening the hardware security than before with the increasing diversity of attacks such as Spectre and Meltdown. Vendor patches cannot keep up with the pace of the new threats, which makes the dynamic anomaly detection tools more evident than before. Unfortunately, previous studies utilize hardware performance counters that lead to high performance overhead and profile limited number of microarchitectural attacks due to the small number of counters that can be profiled concurrently. This yields those detection tools inefficient in real-world scenarios. In this study, we introduce MAD-EN dynamic detection tool that leverages system-wide energy consumption traces collected from a generic Intel RAPL tool to detect ongoing anomalies in a system. In our experiments, we show that CNN-based MAD-EN can detect 10 different microarchitectural attacks with a total of 15 variants with the highest F1 score of 0.999, which makes our tool the most generic attack detection tool so far. Moreover, individual attacks can be distinguished with a 98% accuracy after an anomaly is detected in a system. We demonstrate that MAD-EN introduces 69.3% less performance overhead compared to performance counter-based detection mechanisms.
    PAGER: Progressive Attribute-Guided Extendable Robust Image Generation. (arXiv:2206.00162v1 [cs.CV])
    This work presents a generative modeling approach based on successive subspace learning (SSL). Unlike most generative models in the literature, our method does not utilize neural networks to analyze the underlying source distribution and synthesize images. The resulting method, called the progressive attribute-guided extendable robust image generative (PAGER) model, has advantages in mathematical transparency, progressive content generation, lower training time, robust performance with fewer training samples, and extendibility to conditional image generation. PAGER consists of three modules: core generator, resolution enhancer, and quality booster. The core generator learns the distribution of low-resolution images and performs unconditional image generation. The resolution enhancer increases image resolution via conditional generation. Finally, the quality booster adds finer details to generated images. Extensive experiments on MNIST, Fashion-MNIST, and CelebA datasets are conducted to demonstrate generative performance of PAGER.
    Self-supervised Learning for Label Sparsity in Computational Drug Repositioning. (arXiv:2206.00262v1 [cs.LG])
    The computational drug repositioning aims to discover new uses for marketed drugs, which can accelerate the drug development process and play an important role in the existing drug discovery system. However, the number of validated drug-disease associations is scarce compared to the number of drugs and diseases in the real world. Too few labeled samples will make the classification model unable to learn effective latent factors of drugs, resulting in poor generalization performance. In this work, we propose a multi-task self-supervised learning framework for computational drug repositioning. The framework tackles label sparsity by learning a better drug representation. Specifically, we take the drug-disease association prediction problem as the main task, and the auxiliary task is to use data augmentation strategies and contrast learning to mine the internal relationships of the original drug features, so as to automatically learn a better drug representation without supervised labels. And through joint training, it is ensured that the auxiliary task can improve the prediction accuracy of the main task. More precisely, the auxiliary task improves drug representation and serving as additional regularization to improve generalization. Furthermore, we design a multi-input decoding network to improve the reconstruction ability of the autoencoder model. We evaluate our model using three real-world datasets. The experimental results demonstrate the effectiveness of the multi-task self-supervised learning framework, and its predictive ability is superior to the state-of-the-art model.
    Strongly Augmented Contrastive Clustering. (arXiv:2206.00380v1 [cs.LG])
    Deep clustering has attracted increasing attention in recent years due to its capability of joint representation learning and clustering via deep neural networks. In its latest developments, the contrastive learning has emerged as an effective technique to substantially enhance the deep clustering performance. However, the existing contrastive learning based deep clustering algorithms mostly focus on some carefully-designed augmentations (often with limited transformations to preserve the structure), referred to as weak augmentations, but cannot go beyond the weak augmentations to explore the more opportunities in stronger augmentations (with more aggressive transformations or even severe distortions). In this paper, we present an end-to-end deep clustering approach termed strongly augmented contrastive clustering (SACC), which extends the conventional two-augmentation-view paradigm to multiple views and jointly leverages strong and weak augmentations for strengthened deep clustering. Particularly, we utilize a backbone network with triply-shared weights, where a strongly augmented view and two weakly augmented views are incorporated. Based on the representations produced by the backbone, the weak-weak view pair and the strong-weak view pairs are simultaneously exploited for the instance-level contrastive learning (via an instance projector) and the cluster-level contrastive learning (via a cluster projector), which, together with the backbone, can be jointly optimized in a purely unsupervised manner. Experimental results on five challenging image datasets have shown the superior performance of the proposed SACC approach over the state-of-the-art.
    DM$^2$: Distributed Multi-Agent Reinforcement Learning for Distribution Matching. (arXiv:2206.00233v1 [cs.MA])
    Current approaches to multi-agent cooperation rely heavily on centralized mechanisms or explicit communication protocols to ensure convergence. This paper studies the problem of distributed multi-agent learning without resorting to explicit coordination schemes. The proposed algorithm (DM$^2$) leverages distribution matching to facilitate independent agents' coordination. Each individual agent matches a target distribution of concurrently sampled trajectories from a joint expert policy. The theoretical analysis shows that under some conditions, if each agent optimizes their individual distribution matching objective, the agents increase a lower bound on the objective of matching the joint expert policy, allowing convergence to the joint expert policy. Further, if the distribution matching objective is aligned with a joint task, a combination of environment reward and distribution matching reward leads to the same equilibrium. Experimental validation on the StarCraft domain shows that combining the reward for distribution matching with the environment reward allows agents to outperform a fully distributed baseline. Additional experiments probe the conditions under which expert demonstrations need to be sampled in order to outperform the fully distributed baseline.
    Evaluating Gaussian Grasp Maps for Generative Grasping Models. (arXiv:2206.00432v1 [cs.RO])
    Generalising robotic grasping to previously unseen objects is a key task in general robotic manipulation. The current method for training many antipodal generative grasping models rely on a binary ground truth grasp map generated from the centre thirds of correctly labelled grasp rectangles. However, these binary maps do not accurately reflect the positions in which a robotic arm can correctly grasp a given object. We propose a continuous Gaussian representation of annotated grasps to generate ground truth training data which achieves a higher success rate on a simulated robotic grasping benchmark. Three modern generative grasping networks are trained with either binary or Gaussian grasp maps, along with recent advancements from the robotic grasping literature, such as discretisation of grasp angles into bins and an attentional loss function. Despite negligible difference according to the standard rectangle metric, Gaussian maps better reproduce the training data and therefore improve success rates when tested on the same simulated robot arm by avoiding collisions with the object: achieving 87.94\% accuracy. Furthermore, the best performing model is shown to operate with a high success rate when transferred to a real robotic arm, at high inference speeds, without the need for transfer learning. The system is then shown to be capable of performing grasps on an antagonistic physical object dataset benchmark.
    Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer. (arXiv:2206.00481v1 [cs.CV])
    Vision Transformers (ViTs) enabled the use of transformer architecture on vision tasks showing impressive performances when trained on big datasets. However, on relatively small datasets, ViTs are less accurate given their lack of inductive bias. To this end, we propose a simple but still effective self-supervised learning (SSL) strategy to train ViTs, that without any external annotation, can significantly improve the results. Specifically, we define a set of SSL tasks based on relations of image patches that the model has to solve before or jointly during the downstream training. Differently from ViT, our RelViT model optimizes all the output tokens of the transformer encoder that are related to the image patches, thus exploiting more training signal at each training step. We investigated our proposed methods on several image benchmarks finding that RelViT improves the SSL state-of-the-art methods by a large margin, especially on small datasets.
    A Kernelised Stein Statistic for Assessing Implicit Generative Models. (arXiv:2206.00149v1 [stat.ML])
    Synthetic data generation has become a key ingredient for training machine learning procedures, addressing tasks such as data augmentation, analysing privacy-sensitive data, or visualising representative samples. Assessing the quality of such synthetic data generators hence has to be addressed. As (deep) generative models for synthetic data often do not admit explicit probability distributions, classical statistical procedures for assessing model goodness-of-fit may not be applicable. In this paper, we propose a principled procedure to assess the quality of a synthetic data generator. The procedure is a kernelised Stein discrepancy (KSD)-type test which is based on a non-parametric Stein operator for the synthetic data generator of interest. This operator is estimated from samples which are obtained from the synthetic data generator and hence can be applied even when the model is only implicit. In contrast to classical testing, the sample size from the synthetic data generator can be as large as desired, while the size of the observed data, which the generator aims to emulate is fixed. Experimental results on synthetic distributions and trained generative models on synthetic and real datasets illustrate that the method shows improved power performance compared to existing approaches.
    CoNSoLe: Convex Neural Symbolic Learning. (arXiv:2206.00257v1 [cs.LG])
    Learning the underlying equation from data is a fundamental problem in many disciplines. Recent advances rely on Neural Networks (NNs) but do not provide theoretical guarantees in obtaining the exact equations owing to the non-convexity of NNs. In this paper, we propose Convex Neural Symbolic Learning (CoNSoLe) to seek convexity under mild conditions. The main idea is to decompose the recovering process into two steps and convexify each step. In the first step of searching for right symbols, we convexify the deep Q-learning. The key is to maintain double convexity for both the negative Q-function and the negative reward function in each iteration, leading to provable convexity of the negative optimal Q function to learn the true symbol connections. Conditioned on the exact searching result, we construct a Locally Convex equation Learner (LoCaL) neural network to convexify the estimation of symbol coefficients. With such a design, we quantify a large region with strict convexity in the loss surface of LoCaL for commonly used physical functions. Finally, we demonstrate the superior performance of the CoNSoLe framework over the state-of-the-art on a diverse set of datasets.
    Convergence of Stein Variational Gradient Descent under a Weaker Smoothness Condition. (arXiv:2206.00508v1 [math.ST])
    Stein Variational Gradient Descent (SVGD) is an important alternative to the Langevin-type algorithms for sampling from probability distributions of the form $\pi(x) \propto \exp(-V(x))$. In the existing theory of Langevin-type algorithms and SVGD, the potential function $V$ is often assumed to be $L$-smooth. However, this restrictive condition excludes a large class of potential functions such as polynomials of degree greater than $2$. Our paper studies the convergence of the SVGD algorithm for distributions with $(L_0,L_1)$-smooth potentials. This relaxed smoothness assumption was introduced by Zhang et al. [2019a] for the analysis of gradient clipping algorithms. With the help of trajectory-independent auxiliary conditions, we provide a descent lemma establishing that the algorithm decreases the $\mathrm{KL}$ divergence at each iteration and prove a complexity bound for SVGD in the population limit in terms of the Stein Fisher information.
    Learning Sparse Nonlinear Dynamics via Mixed-Integer Optimization. (arXiv:2206.00176v1 [cs.LG])
    Discovering governing equations of complex dynamical systems directly from data is a central problem in scientific machine learning. In recent years, the sparse identification of nonlinear dynamics (SINDy) framework, powered by heuristic sparse regression methods, has become a dominant tool for learning parsimonious models. We propose an exact formulation of the SINDy problem using mixed-integer optimization (MIO) to solve the sparsity constrained regression problem to provable optimality in seconds. On a large number of canonical ordinary and partial differential equations, we illustrate the dramatic improvement of our approach in accurate model discovery while being more sample efficient, robust to noise, and flexible in accommodating physical constraints.
    Open Environment Machine Learning. (arXiv:2206.00423v1 [cs.LG])
    Conventional machine learning studies generally assume close world scenarios where important factors of the learning process hold invariant. With the great success of machine learning, nowadays, more and more practical tasks, particularly those involving open world scenarios where important factors are subject to change, called open environment machine learning (Open ML) in this article, are present to the community. Evidently it is a grand challenge for machine learning turning from close world to open world. It becomes even more challenging since, in various big data tasks, data are usually accumulated with time, like streams, while it is hard to train the machine learning model after collecting all data as in conventional studies. This article briefly introduces some advances in this line of research, focusing on techniques concerning emerging new classes, decremental/incremental features, changing data distributions, varied learning objectives, and discusses some theoretical issues.
    Fairness Transferability Subject to Bounded Distribution Shift. (arXiv:2206.00129v1 [cs.LG])
    Given an algorithmic predictor that is "fair" on some source distribution, will it still be fair on an unknown target distribution that differs from the source within some bound? In this paper, we study the transferability of statistical group fairness for machine learning predictors (i.e., classifiers or regressors) subject to bounded distribution shift, a phenomenon frequently caused by user adaptation to a deployed model or a dynamic environment. Herein, we develop a bound characterizing such transferability, flagging potentially inappropriate deployments of machine learning for socially consequential tasks. We first develop a framework for bounding violations of statistical fairness subject to distribution shift, formulating a generic upper bound for transferred fairness violation as our primary result. We then develop bounds for specific worked examples, adopting two commonly used fairness definitions (i.e., demographic parity and equalized odds) for two classes of distribution shift (i.e., covariate shift and label shift). Finally, we compare our theoretical bounds to deterministic models of distribution shift as well as real-world data.
    Asymptotic Properties for Bayesian Neural Network in Besov Space. (arXiv:2206.00241v1 [stat.ML])
    Neural networks have shown great predictive power when dealing with various unstructured data such as images and natural languages. The Bayesian neural network captures the uncertainty of prediction by putting a prior distribution for the parameter of the model and computing the posterior distribution. In this paper, we show that the Bayesian neural network using spike-and-slab prior has consistency with nearly minimax convergence rate when the true regression function is in the Besov space. Even when the smoothness of the regression function is unknown the same posterior convergence rate holds and thus the spike and slab prior is adaptive to the smoothness of the regression function. We also consider the shrinkage prior and show that it has the same convergence rate. In other words, we propose a practical Bayesian neural network with guaranteed asymptotic properties.
    Online Nonsubmodular Minimization with Delayed Costs: From Full Information to Bandit Feedback. (arXiv:2205.07217v2 [cs.LG] UPDATED)
    Motivated by applications to online learning in sparse estimation and Bayesian optimization, we consider the problem of online unconstrained nonsubmodular minimization with delayed costs in both full information and bandit feedback settings. In contrast to previous works on online unconstrained submodular minimization, we focus on a class of nonsubmodular functions with special structure, and prove regret guarantees for several variants of the online and approximate online bandit gradient descent algorithms in static and delayed scenarios. We derive bounds for the agent's regret in the full information and bandit feedback setting, even if the delay between choosing a decision and receiving the incurred cost is unbounded. Key to our approach is the notion of $(\alpha, \beta)$-regret and the extension of the generic convex relaxation model from~\citet{El-2020-Optimal}, the analysis of which is of independent interest. We conduct and showcase several simulation studies to demonstrate the efficacy of our algorithms.
    AVIDA: Alternating method for Visualizing and Integrating Data. (arXiv:2206.00135v1 [q-bio.QM])
    High-dimensional multimodal data arises in many scientific fields. The integration of multimodal data becomes challenging when there is no known correspondence between the samples and the features of different datasets. To tackle this challenge, we introduce AVIDA, a framework for simultaneously performing data alignment and dimension reduction. In the numerical experiments, Gromov-Wasserstein optimal transport and t-distributed stochastic neighbor embedding are used as the alignment and dimension reduction modules respectively. We show that AVIDA correctly aligns high-dimensional datasets without common features with four synthesized datasets and two real multimodal single-cell datasets. Compared to several existing methods, we demonstrate that AVIDA better preserves structures of individual datasets, especially distinct local structures in the joint low-dimensional visualization, while achieving comparable alignment performance. Such a property is important in multimodal single-cell data analysis as some biological processes are uniquely captured by one of the datasets. In general applications, other methods can be used for the alignment and dimension reduction modules.
    Stochastic Gradient Methods with Preconditioned Updates. (arXiv:2206.00285v1 [math.OC])
    This work considers non-convex finite sum minimization. There are a number of algorithms for such problems, but existing methods often work poorly when the problem is badly scaled and/or ill-conditioned, and a primary goal of this work is to introduce methods that alleviate this issue. Thus, here we include a preconditioner that is based upon Hutchinson's approach to approximating the diagonal of the Hessian, and couple it with several gradient based methods to give new `scaled' algorithms: {\tt Scaled SARAH} and {\tt Scaled L-SVRG}. Theoretical complexity guarantees under smoothness assumptions are presented, and we prove linear convergence when both smoothness and the PL-condition is assumed. Because our adaptively scaled methods use approximate partial second order curvature information, they are better able to mitigate the impact of badly scaled problems, and this improved practical performance is demonstrated in the numerical experiments that are also presented in this work.
    A robust and lightweight deep attention multiple instance learning algorithm for predicting genetic alterations. (arXiv:2206.00455v1 [q-bio.QM])
    Deep-learning models based on whole-slide digital pathology images (WSIs) become increasingly popular for predicting molecular biomarkers. Instance-based models has been the mainstream strategy for predicting genetic alterations using WSIs although bag-based models along with self-attention mechanism-based algorithms have been proposed for other digital pathology applications. In this paper, we proposed a novel Attention-based Multiple Instance Mutation Learning (AMIML) model for predicting gene mutations. AMIML was comprised of successive 1-D convolutional layers, a decoder, and a residual weight connection to facilitate further integration of a lightweight attention mechanism to detect the most predictive image patches. Using data for 24 clinically relevant genes from four cancer cohorts in The Cancer Genome Atlas (TCGA) studies (UCEC, BRCA, GBM and KIRC), we compared AMIML with one popular instance-based model and four recently published bag-based models (e.g., CHOWDER, HE2RNA, etc.). AMIML demonstrated excellent robustness, not only outperforming all the five baseline algorithms in the vast majority of the tested genes (17 out of 24), but also providing near-best-performance for the other seven genes. Conversely, the performance of the baseline published algorithms varied across different cancers/genes. In addition, compared to the published models for genetic alterations, AMIML provided a significant improvement for predicting a wide range of genes (e.g., KMT2C, TP53, and SETD2 for KIRC; ERBB2, BRCA1, and BRCA2 for BRCA; JAK1, POLE, and MTOR for UCEC) as well as produced outstanding predictive models for other clinically relevant gene mutations, which have not been reported in the current literature. Furthermore, with the flexible and interpretable attention-based MIL pooling mechanism, AMIML could further zero-in and detect predictive image patches.
    Retrieval-Augmented Multilingual Keyphrase Generation with Retriever-Generator Iterative Training. (arXiv:2205.10471v2 [cs.CL] UPDATED)
    Keyphrase generation is the task of automatically predicting keyphrases given a piece of long text. Despite its recent flourishing, keyphrase generation on non-English languages haven't been vastly investigated. In this paper, we call attention to a new setting named multilingual keyphrase generation and we contribute two new datasets, EcommerceMKP and AcademicMKP, covering six languages. Technically, we propose a retrieval-augmented method for multilingual keyphrase generation to mitigate the data shortage problem in non-English languages. The retrieval-augmented model leverages keyphrase annotations in English datasets to facilitate generating keyphrases in low-resource languages. Given a non-English passage, a cross-lingual dense passage retrieval module finds relevant English passages. Then the associated English keyphrases serve as external knowledge for keyphrase generation in the current language. Moreover, we develop a retriever-generator iterative training algorithm to mine pseudo parallel passage pairs to strengthen the cross-lingual passage retriever. Comprehensive experiments and ablations show that the proposed approach outperforms all baselines.
    Predicting Political Ideology from Digital Footprints. (arXiv:2206.00397v1 [econ.GN])
    This paper proposes a new method to predict individual political ideology from digital footprints on one of the world's largest online discussion forum. We compiled a unique data set from the online discussion forum reddit that contains information on the political ideology of around 91,000 users as well as records of their comment frequency and the comments' text corpus in over 190,000 different subforums of interest. Applying a set of statistical learning approaches, we show that information about activity in non-political discussion forums alone, can very accurately predict a user's political ideology. Depending on the model, we are able to predict the economic dimension of ideology with an accuracy of up to 90.63% and the social dimension with and accuracy of up to 82.02%. In comparison, using the textual features from actual comments does not improve predictive accuracy. Our paper highlights the importance of revealed digital behaviour to complement stated preferences from digital communication when analysing human preferences and behaviour using online data.
    Provably Efficient Lifelong Reinforcement Learning with Linear Function Approximation. (arXiv:2206.00270v1 [cs.LG])
    We study lifelong reinforcement learning (RL) in a regret minimization setting of linear contextual Markov decision process (MDP), where the agent needs to learn a multi-task policy while solving a streaming sequence of tasks. We propose an algorithm, called UCB Lifelong Value Distillation (UCBlvd), that provably achieves sublinear regret for any sequence of tasks, which may be adaptively chosen based on the agent's past behaviors. Remarkably, our algorithm uses only sublinear number of planning calls, which means that the agent eventually learns a policy that is near optimal for multiple tasks (seen or unseen) without the need of deliberate planning. A key to this property is a new structural assumption that enables computation sharing across tasks during exploration. Specifically, for $K$ task episodes of horizon $H$, our algorithm has a regret bound $\tilde{\mathcal{O}}(\sqrt{(d^3+d^\prime d)H^4K})$ based on $\mathcal{O}(dH\log(K))$ number of planning calls, where $d$ and $d^\prime$ are the feature dimensions of the dynamics and rewards, respectively. This theoretical guarantee implies that our algorithm can enable a lifelong learning agent to accumulate experiences and learn to rapidly solve new tasks.
    Physical Modeling using Recurrent Neural Networks with Fast Convolutional Layers. (arXiv:2204.10125v2 [cs.SD] UPDATED)
    Discrete-time modeling of acoustic, mechanical and electrical systems is a prominent topic in the musical signal processing literature. Such models are mostly derived by discretizing a mathematical model, given in terms of ordinary or partial differential equations, using established techniques. Recent work has applied the techniques of machine-learning to construct such models automatically from data for the case of systems which have lumped states described by scalar values, such as electrical circuits. In this work, we examine how similar techniques are able to construct models of systems which have spatially distributed rather than lumped states. We describe several novel recurrent neural network structures, and show how they can be thought of as an extension of modal techniques. As a proof of concept, we generate synthetic data for three physical systems and show that the proposed network structures can be trained with this data to reproduce the behavior of these systems.
    Multi-block Min-max Bilevel Optimization with Applications in Multi-task Deep AUC Maximization. (arXiv:2206.00260v1 [math.OC])
    In this paper, we study multi-block min-max bilevel optimization problems, where the upper level is non-convex strongly-concave minimax objective and the lower level is a strongly convex objective, and there are multiple blocks of dual variables and lower level problems. Due to the intertwined multi-block min-max bilevel structure, the computational cost at each iteration could be prohibitively high, especially with a large number of blocks. To tackle this challenge, we present a single-loop randomized stochastic algorithm, which requires updates for only a constant number of blocks at each iteration. Under some mild assumptions on the problem, we establish its sample complexity of $\mathcal{O}(1/\epsilon^4)$ for finding an $\epsilon$-stationary point. This matches the optimal complexity for solving stochastic nonconvex optimization under a general unbiased stochastic oracle model. Moreover, we provide two applications of the proposed method in multi-task deep AUC (area under ROC curve) maximization and multi-task deep partial AUC maximization. Experimental results validate our theory and demonstrate the effectiveness of our method on problems with hundreds of tasks.
    Semantic Probabilistic Layers for Neuro-Symbolic Learning. (arXiv:2206.00426v1 [cs.LG])
    We design a predictive layer for structured-output prediction (SOP) that can be plugged into any neural network guaranteeing its predictions are consistent with a set of predefined symbolic constraints. Our Semantic Probabilistic Layer (SPL) can model intricate correlations, and hard constraints, over a structured output space all while being amenable to end-to-end learning via maximum likelihood. SPLs combine exact probabilistic inference with logical reasoning in a clean and modular way, learning complex distributions and restricting their support to solutions of the constraint. As such, they can faithfully, and efficiently, model complex SOP tasks beyond the reach of alternative neuro-symbolic approaches. We empirically demonstrate that SPLs outperform these competitors in terms of accuracy on challenging SOP tasks including hierarchical multi-label classification, pathfinding and preference learning, while retaining perfect constraint satisfaction.
    Top-down inference in an early visual cortex inspired hierarchical Variational Autoencoder. (arXiv:2206.00436v1 [q-bio.NC])
    Interpreting computations in the visual cortex as learning and inference in a generative model of the environment has received wide support both in neuroscience and cognitive science. However, hierarchical computations, a hallmark of visual cortical processing, has remained impervious for generative models because of a lack of adequate tools to address it. Here we capitalize on advances in Variational Autoencoders (VAEs) to investigate the early visual cortex with sparse coding hierarchical VAEs trained on natural images. We design alternative architectures that vary both in terms of the generative and the recognition components of the two latent-layer VAE. We show that representations similar to the one found in the primary and secondary visual cortices naturally emerge under mild inductive biases. Importantly, a nonlinear representation for texture-like patterns is a stable property of the high-level latent space resistant to the specific architecture of the VAE, reminiscent of the secondary visual cortex. We show that a neuroscience-inspired choice of the recognition model, which features a top-down processing component is critical for two signatures of computations with generative models: learning higher order moments of the posterior beyond the mean and image inpainting. Patterns in higher order response statistics provide inspirations for neuroscience to interpret response correlations and for machine learning to evaluate the learned representations through more detailed characterization of the posterior.
    Lower and Upper Bounds for Numbers of Linear Regions of Graph Convolutional Networks. (arXiv:2206.00228v1 [cs.LG])
    The research for characterizing GNN expressiveness attracts much attention as graph neural networks achieve a champion in the last five years. The number of linear regions has been considered a good measure for the expressivity of neural networks with piecewise linear activation. In this paper, we present some estimates for the number of linear regions of the classic graph convolutional networks (GCNs) with one layer and multiple-layer scenarios. In particular, we obtain an optimal upper bound for the maximum number of linear regions for one-layer GCNs, and the upper and lower bounds for multi-layer GCNs. The simulated estimate shows that the true maximum number of linear regions is possibly closer to our estimated lower bound. These results imply that the number of linear regions of multi-layer GCNs is exponentially greater than one-layer GCNs per parameter in general. This suggests that deeper GCNs have more expressivity than shallow GCNs.
    Continuous Prediction with Experts' Advice. (arXiv:2206.00236v1 [cs.LG])
    Prediction with experts' advice is one of the most fundamental problems in online learning and captures many of its technical challenges. A recent line of work has looked at online learning through the lens of differential equations and continuous-time analysis. This viewpoint has yielded optimal results for several problems in online learning. In this paper, we employ continuous-time stochastic calculus in order to study the discrete-time experts' problem. We use these tools to design a continuous-time, parameter-free algorithm with improved guarantees for the quantile regret. We then develop an analogous discrete-time algorithm with a very similar analysis and identical quantile regret bounds. Finally, we design an anytime continuous-time algorithm with regret matching the optimal fixed-time rate when the gains are independent Brownian Motions; in many settings, this is the most difficult case. This gives some evidence that, even with adversarial gains, the optimal anytime and fixed-time regrets may coincide.
    OOD Link Prediction Generalization Capabilities of Message-Passing GNNs in Larger Test Graphs. (arXiv:2205.15117v2 [cs.LG] UPDATED)
    This work provides the first theoretical study on the ability of graph Message Passing Neural Networks (gMPNNs) -- such as Graph Neural Networks (GNNs) -- to perform inductive out-of-distribution (OOD) link prediction tasks, where deployment (test) graph sizes are larger than training graphs. We first prove non-asymptotic bounds showing that link predictors based on permutation-equivariant (structural) node embeddings obtained by gMPNNs can converge to a random guess as test graphs get larger. We then propose a theoretically-sound gMPNN that outputs structural pairwise (2-node) embeddings and prove non-asymptotic bounds showing that, as test graphs grow, these embeddings converge to embeddings of a continuous function that retains its ability to predict links OOD. Empirical results on random graphs show agreement with our theoretical results.
    Contrastive Principal Component Learning: Modeling Similarity by Augmentation Overlap. (arXiv:2206.00471v1 [cs.LG])
    Traditional self-supervised contrastive learning methods learn embeddings by pulling views of the same sample together and pushing views of different samples away. Since views of a sample are usually generated via data augmentations, the semantic relationship between samples is ignored. Based on the observation that semantically similar samples are more likely to have similar augmentations, we propose to measure similarity via the distribution of augmentations, i.e., how much the augmentations of two samples overlap. To handle the dimensional and computational complexity, we propose a novel Contrastive Principal Component Learning (CPCL) method composed of a contrastive-like loss and an on-the-fly projection loss to efficiently perform PCA on the augmentation feature, which encodes the augmentation distribution. By CPCL, the learned low-dimensional embeddings theoretically preserve the similarity of augmentation distribution between samples. Empirical results show our method can achieve competitive results against various traditional contrastive learning methods on different benchmarks.
    On Layer Normalizations and Residual Connections in Transformers. (arXiv:2206.00330v1 [cs.LG])
    In the perspective of a layer normalization (LN) position, the architecture of Transformers can be categorized into two types: Post-LN and Pre-LN. Recent Transformers prefer to select Pre-LN because the training in Post-LN with deep Transformers, e.g., ten or more layers, often becomes unstable, resulting in useless models. However, in contrast, Post-LN has also consistently achieved better performance than Pre-LN in relatively shallow Transformers, e.g., six or fewer layers. This study first investigates the reason for these discrepant observations empirically and theoretically and discovers 1, the LN in Post-LN is the source of the vanishing gradient problem that mainly leads the unstable training whereas Pre-LN prevents it, and 2, Post-LN tends to preserve larger gradient norms in higher layers during the back-propagation that may lead an effective training. Exploiting the new findings, we propose a method that can equip both higher stability and effective training by a simple modification from Post-LN. We conduct experiments on a wide range of text generation tasks and demonstrate that our method outperforms Pre-LN, and stable training regardless of the shallow or deep layer settings.
    Incentivizing Combinatorial Bandit Exploration. (arXiv:2206.00494v1 [cs.LG])
    Consider a bandit algorithm that recommends actions to self-interested users in a recommendation system. The users are free to choose other actions and need to be incentivized to follow the algorithm's recommendations. While the users prefer to exploit, the algorithm can incentivize them to explore by leveraging the information collected from the previous users. All published work on this problem, known as incentivized exploration, focuses on small, unstructured action sets and mainly targets the case when the users' beliefs are independent across actions. However, realistic exploration problems often feature large, structured action sets and highly correlated beliefs. We focus on a paradigmatic exploration problem with structure: combinatorial semi-bandits. We prove that Thompson Sampling, when applied to combinatorial semi-bandits, is incentive-compatible when initialized with a sufficient number of samples of each arm (where this number is determined in advance by the Bayesian prior). Moreover, we design incentive-compatible algorithms for collecting the initial samples.
    A Generalized Supervised Contrastive Learning Framework. (arXiv:2206.00384v1 [cs.CV])
    Based on recent remarkable achievements of contrastive learning in self-supervised representation learning, supervised contrastive learning (SupCon) has successfully extended the batch contrastive approaches to the supervised context and outperformed cross-entropy on various datasets on ResNet. In this work, we present GenSCL: a generalized supervised contrastive learning framework that seamlessly adapts modern image-based regularizations (such as Mixup-Cutmix) and knowledge distillation (KD) to SupCon by our generalized supervised contrastive loss. Generalized supervised contrastive loss is a further extension of supervised contrastive loss measuring cross-entropy between the similarity of labels and that of latent features. Then a model can learn to what extent contrastives should be pulled closer to an anchor in the latent space. By explicitly and fully leveraging label information, GenSCL breaks the boundary between conventional positives and negatives, and any kind of pre-trained teacher classifier can be utilized. ResNet-50 trained in GenSCL with Mixup-Cutmix and KD achieves state-of-the-art accuracies of 97.6% and 84.7% on CIFAR10 and CIFAR100 without external data, which significantly improves the results reported in the original SupCon (1.6% and 8.2%, respectively). Pytorch implementation is available at https://t.ly/yuUO.
    A comparative study between vision transformers and CNNs in digital pathology. (arXiv:2206.00389v1 [eess.IV])
    Recently, vision transformers were shown to be capable of outperforming convolutional neural networks when pretrained on sufficient amounts of data. In comparison to convolutional neural networks, vision transformers have a weaker inductive bias and therefore allow a more flexible feature detection. Due to their promising feature detection, this work explores vision transformers for tumor detection in digital pathology whole slide images in four tissue types, and for tissue type identification. We compared the patch-wise classification performance of the vision transformer DeiT-Tiny to the state-of-the-art convolutional neural network ResNet18. Due to the sparse availability of annotated whole slide images, we further compared both models pretrained on large amounts of unlabeled whole-slide images using state-of-the-art self-supervised approaches. The results show that the vision transformer performed slightly better than the ResNet18 for three of four tissue types for tumor detection while the ResNet18 performed slightly better for the remaining tasks. The aggregated predictions of both models on slide level were correlated, indicating that the models captured similar imaging features. All together, the vision transformer models performed on par with the ResNet18 while requiring more effort to train. In order to surpass the performance of convolutional neural networks, vision transformers might require more challenging tasks to benefit from their weak inductive bias.
    GPT-3 Models are Poor Few-Shot Learners in the Biomedical Domain. (arXiv:2109.02555v2 [cs.CL] UPDATED)
    Deep neural language models have set new breakthroughs in many tasks of Natural Language Processing (NLP). Recent work has shown that deep transformer language models (pretrained on large amounts of texts) can achieve high levels of task-specific few-shot performance comparable to state-of-the-art models. However, the ability of these large language models in few-shot transfer learning has not yet been explored in the biomedical domain. We investigated the performance of two powerful transformer language models, i.e. GPT-3 and BioBERT, in few-shot settings on various biomedical NLP tasks. The experimental results showed that, to a great extent, both the models underperform a language model fine-tuned on the full training data. Although GPT-3 had already achieved near state-of-the-art results in few-shot knowledge transfer on open-domain NLP tasks, it could not perform as effectively as BioBERT, which is orders of magnitude smaller than GPT-3. Regarding that BioBERT was already pretrained on large biomedical text corpora, our study suggests that language models may largely benefit from in-domain pretraining in task-specific few-shot learning. However, in-domain pretraining seems not to be sufficient; novel pretraining and few-shot learning strategies are required in the biomedical NLP domain.
    WaveMix-Lite: A Resource-efficient Neural Network for Image Analysis. (arXiv:2205.14375v2 [cs.CV] UPDATED)
    Gains in the ability to generalize on image analysis tasks for neural networks have come at the cost of increased number of parameters and layers, dataset sizes, training and test computations, and GPU RAM. We introduce a new architecture -- WaveMix-Lite -- that can generalize on par with contemporary transformers and convolutional neural networks (CNNs) while needing fewer resources. WaveMix-Lite uses 2D-discrete wavelet transform to efficiently mix spatial information from pixels. WaveMix-Lite seems to be a versatile and scalable architectural framework that can be used for multiple vision tasks, such as image classification and semantic segmentation, without requiring significant architectural changes, unlike transformers and CNNs. It is able to meet or exceed several accuracy benchmarks while training on a single GPU. For instance, it achieves state-of-the-art accuracy on five EMNIST datasets, outperforms CNNs and transformers in ImageNet-1K (64$\times$64 images), and achieves an mIoU of 75.32 % on Cityscapes validation set, while using less than one-fifth the number parameters and half the GPU RAM of comparable CNNs or transformers. Our experiments show that while the convolutional elements of neural architectures exploit the shift-invariance property of images, new types of layers (e.g., wavelet transform) can exploit additional properties of images, such as scale-invariance and finite spatial extents of objects.
    A model aggregation approach for high-dimensional large-scale optimization. (arXiv:2205.07525v2 [cs.LG] UPDATED)
    Bayesian optimization (BO) has been widely used in machine learning and simulation optimization. With the increase in computational resources and storage capacities in these fields, high-dimensional and large-scale problems are becoming increasingly common. In this study, we propose a model aggregation method in the Bayesian optimization (MamBO) algorithm for efficiently solving high-dimensional large-scale optimization problems. MamBO uses a combination of subsampling and subspace embeddings to collectively address high dimensionality and large-scale issues; in addition, a model aggregation method is employed to address the surrogate model uncertainty issue that arises when embedding is applied. This surrogate model uncertainty issue is largely ignored in the embedding literature and practice, and it is exacerbated when the problem is high-dimensional and data are limited. Our proposed model aggregation method reduces these lower-dimensional surrogate model risks and improves the robustness of the BO algorithm. We derive an asymptotic bound for the proposed aggregated surrogate model and prove the convergence of MamBO. Benchmark numerical experiments indicate that our algorithm achieves superior or comparable performance to other commonly used high-dimensional BO algorithms. Moreover, we apply MamBO to a cascade classifier of a machine learning algorithm for face detection, and the results reveal that MamBO finds settings that achieve higher classification accuracy than the benchmark settings and is computationally faster than other high-dimensional BO algorithms.
    Graph Neural Networks are Dynamic Programmers. (arXiv:2203.15544v2 [cs.LG] UPDATED)
    Recent advances in neural algorithmic reasoning with graph neural networks (GNNs) are propped up by the notion of algorithmic alignment. Broadly, a neural network will be better at learning to execute a reasoning task (in terms of sample complexity) if its individual components align well with the target algorithm. Specifically, GNNs are claimed to align with dynamic programming (DP), a general problem-solving strategy which expresses many polynomial-time algorithms. However, has this alignment truly been demonstrated and theoretically quantified? Here we show, using methods from category theory and abstract algebra, that there exists an intricate connection between GNNs and DP, going well beyond the initial observations over individual algorithms such as Bellman-Ford. Exposing this connection, we easily verify several prior findings in the literature, produce better-grounded GNN architectures for edge-centric tasks, and demonstrate empirical results on the CLRS algorithmic reasoning benchmark. We hope our exposition will serve as a foundation for building stronger algorithmically aligned GNNs.
    TEE-based decentralized recommender systems: The raw data sharing redemption. (arXiv:2202.11655v2 [cs.DC] UPDATED)
    Recommenders are central in many applications today. The most effective recommendation schemes, such as those based on collaborative filtering (CF), exploit similarities between user profiles to make recommendations, but potentially expose private data. Federated learning and decentralized learning systems address this by letting the data stay on user's machines to preserve privacy: each user performs the training on local data and only the model parameters are shared. However, sharing the model parameters across the network may still yield privacy breaches. In this paper, we present REX, the first enclave-based decentralized CF recommender. REX exploits Trusted execution environments (TEE), such as Intel software guard extensions (SGX), that provide shielded environments within the processor to improve convergence while preserving privacy. Firstly, REX enables raw data sharing, which ultimately speeds up convergence and reduces the network load. Secondly, REX fully preserves privacy. We analyze the impact of raw data sharing in both deep neural network (DNN) and matrix factorization (MF) recommenders and showcase the benefits of trusted environments in a full-fledged implementation of REX. Our experimental results demonstrate that through raw data sharing, REX significantly decreases the training time by 18.3x and the network load by 2 orders of magnitude over standard decentralized approaches that share only parameters, while fully protecting privacy by leveraging trustworthy hardware enclaves with very little overhead.
    Local Stochastic Factored Gradient Descent for Distributed Quantum State Tomography. (arXiv:2203.11579v2 [quant-ph] UPDATED)
    We propose a distributed Quantum State Tomography (QST) protocol, named Local Stochastic Factored Gradient Descent (Local SFGD), to learn the low-rank factor of a density matrix over a set of local machines. QST is the canonical procedure to characterize the state of a quantum system, which we formulate as a stochastic nonconvex smooth optimization problem. Physically, the estimation of a low-rank density matrix helps characterizing the amount of noise introduced by quantum computation. Theoretically, we prove the local convergence of Local SFGD for a general class of restricted strongly convex/smooth loss functions, i.e., Local SFGD converges locally to a small neighborhood of the global optimum at a linear rate with a constant step size, while it locally converges exactly at a sub-linear rate with diminishing step sizes. With a proper initialization, local convergence results imply global convergence. We validate our theoretical findings with numerical simulations of QST on the Greenberger-Horne-Zeilinger (GHZ) state.
    Fishr: Invariant Gradient Variances for Out-of-Distribution Generalization. (arXiv:2109.02934v3 [cs.LG] UPDATED)
    Learning robust models that generalize well under changes in the data distribution is critical for real-world applications. To this end, there has been a growing surge of interest to learn simultaneously from multiple training domains - while enforcing different types of invariance across those domains. Yet, all existing approaches fail to show systematic benefits under controlled evaluation protocols. In this paper, we introduce a new regularization - named Fishr - that enforces domain invariance in the space of the gradients of the loss: specifically, the domain-level variances of gradients are matched across training domains. Our approach is based on the close relations between the gradient covariance, the Fisher Information and the Hessian of the loss: in particular, we show that Fishr eventually aligns the domain-level loss landscapes locally around the final weights. Extensive experiments demonstrate the effectiveness of Fishr for out-of-distribution generalization. Notably, Fishr improves the state of the art on the DomainBed benchmark and performs consistently better than Empirical Risk Minimization. Our code is available at https://github.com/alexrame/fishr.
    AgraSSt: Approximate Graph Stein Statistics for Interpretable Assessment of Implicit Graph Generators. (arXiv:2203.03673v2 [stat.ML] UPDATED)
    We propose and analyse a novel statistical procedure, coined AgraSSt, to assess the quality of graph generators that may not be available in explicit form. In particular, AgraSSt can be used to determine whether a learnt graph generating process is capable of generating graphs that resemble a given input graph. Inspired by Stein operators for random graphs, the key idea of AgraSSt is the construction of a kernel discrepancy based on an operator obtained from the graph generator. AgraSSt can provide interpretable criticisms for a graph generator training procedure and help identify reliable sample batches for downstream tasks. Using Stein`s method we give theoretical guarantees for a broad class of random graph models. We provide empirical results on both synthetic input graphs with known graph generation procedures, and real-world input graphs that the state-of-the-art (deep) generative models for graphs are trained on.
    Online Learning for Min Sum Set Cover and Pandora's Box. (arXiv:2202.04870v2 [cs.LG] UPDATED)
    Two central problems in Stochastic Optimization are Min Sum Set Cover and Pandora's Box. In Pandora's Box, we are presented with $n$ boxes, each containing an unknown value and the goal is to open the boxes in some order to minimize the sum of the search cost and the smallest value found. Given a distribution of value vectors, we are asked to identify a near-optimal search order. Min Sum Set Cover corresponds to the case where values are either 0 or infinity. In this work, we study the case where the value vectors are not drawn from a distribution but are presented to a learner in an online fashion. We present a computationally efficient algorithm that is constant-competitive against the cost of the optimal search order. We extend our results to a bandit setting where only the values of the boxes opened are revealed to the learner after every round. We also generalize our results to other commonly studied variants of Pandora's Box and Min Sum Set Cover that involve selecting more than a single value subject to a matroid constraint.
    Plan Your Target and Learn Your Skills: Transferable State-Only Imitation Learning via Decoupled Policy Optimization. (arXiv:2203.02214v2 [cs.LG] UPDATED)
    Recent progress in state-only imitation learning extends the scope of applicability of imitation learning to real-world settings by relieving the need for observing expert actions. However, existing solutions only learn to extract a state-to-action mapping policy from the data, without considering how the expert plans to the target. This hinders the ability to leverage demonstrations and limits the flexibility of the policy. In this paper, we introduce Decoupled Policy Optimization (DePO), which explicitly decouples the policy as a high-level state planner and an inverse dynamics model. With embedded decoupled policy gradient and generative adversarial training, DePO enables knowledge transfer to different action spaces or state transition dynamics, and can generalize the planner to out-of-demonstration state regions. Our in-depth experimental analysis shows the effectiveness of DePO on learning a generalized target state planner while achieving the best imitation performance. We demonstrate the appealing usage of DePO for transferring across different tasks by pre-training, and the potential for co-training agents with various skills.
    StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2. (arXiv:2112.14683v4 [cs.CV] UPDATED)
    Videos show continuous events, yet most $-$ if not all $-$ video synthesis frameworks treat them discretely in time. In this work, we think of videos of what they should be $-$ time-continuous signals, and extend the paradigm of neural representations to build a continuous-time video generator. For this, we first design continuous motion representations through the lens of positional embeddings. Then, we explore the question of training on very sparse videos and demonstrate that a good generator can be learned by using as few as 2 frames per clip. After that, we rethink the traditional image + video discriminators pair and design a holistic discriminator that aggregates temporal information by simply concatenating frames' features. This decreases the training cost and provides richer learning signal to the generator, making it possible to train directly on 1024$^2$ videos for the first time. We build our model on top of StyleGAN2 and it is just ${\approx}5\%$ more expensive to train at the same resolution while achieving almost the same image quality. Moreover, our latent space features similar properties, enabling spatial manipulations that our method can propagate in time. We can generate arbitrarily long videos at arbitrary high frame rate, while prior work struggles to generate even 64 frames at a fixed rate. Our model is tested on four modern 256$^2$ and one 1024$^2$-resolution video synthesis benchmarks. In terms of sheer metrics, it performs on average ${\approx}30\%$ better than the closest runner-up. Project website: https://universome.github.io.
    Model Generation with Provable Coverability for Offline Reinforcement Learning. (arXiv:2206.00316v1 [cs.LG])
    Model-based offline optimization with dynamics-aware policy provides a new perspective for policy learning and out-of-distribution generalization, where the learned policy could adapt to different dynamics enumerated at the training stage. But due to the limitation under the offline setting, the learned model could not mimic real dynamics well enough to support reliable out-of-distribution exploration, which still hinders policy to generalize well. To narrow the gap, previous works roughly ensemble randomly initialized models to better approximate the real dynamics. However, such practice is costly and inefficient, and provides no guarantee on how well the real dynamics could be approximated by the learned models, which we name coverability in this paper. We actively address this issue by generating models with provable ability to cover real dynamics in an efficient and controllable way. To that end, we design a distance metric for dynamic models based on the occupancy of policies under the dynamics, and propose an algorithm to generate models optimizing their coverage for the real dynamics. We give a theoretical analysis on the model generation process and proves that our algorithm could provide enhanced coverability. As a downstream task, we train a dynamics-aware policy with minor or no conservative penalty, and experiments demonstrate that our algorithm outperforms prior offline methods on existing offline RL benchmarks. We also discover that policies learned by our method have better zero-shot transfer performance, implying their better generalization.
    Towards Context-Aware Neural Performance-Score Synchronisation. (arXiv:2206.00454v1 [cs.SD])
    Music can be represented in multiple forms, such as in the audio form as a recording of a performance, in the symbolic form as a computer readable score, or in the image form as a scan of the sheet music. Music synchronisation provides a way to navigate among multiple representations of music in a unified manner by generating an accurate mapping between them, lending itself applicable to a myriad of domains like music education, performance analysis, automatic accompaniment and music editing. Traditional synchronisation methods compute alignment using knowledge-driven and stochastic approaches, typically employing handcrafted features. These methods are often unable to generalise well to different instruments, acoustic environments and recording conditions, and normally assume complete structural agreement between the performances and the scores. This PhD furthers the development of performance-score synchronisation research by proposing data-driven, context-aware alignment approaches, on three fronts: Firstly, I replace the handcrafted features by employing a metric learning based approach that is adaptable to different acoustic settings and performs well in data-scarce conditions. Secondly, I address the handling of structural differences between the performances and scores, which is a common limitation of standard alignment methods. Finally, I eschew the reliance on both feature engineering and dynamic programming, and propose a completely data-driven synchronisation method that computes alignments using a neural framework, whilst also being robust to structural differences between the performances and scores.
    Learning from Small Samples: Transformation-Invariant SVMs with Composition and Locality at Multiple Scales. (arXiv:2109.12784v4 [cs.LG] UPDATED)
    Motivated by the problem of learning with small sample sizes, this paper shows how to incorporate into support-vector machines (SVMs) those properties that have made convolutional neural networks (CNNs) successful. Particularly important is the ability to incorporate domain knowledge of invariances, e.g., translational invariance of images. Kernels based on the \textit{maximum} similarity over a group of transformations are not generally positive definite. Perhaps it is for this reason that they have not been studied theoretically. We address this lacuna and show that positive definiteness indeed holds \textit{with high probability} for kernels based on the maximum similarity in the small training sample set regime of interest, and that they do yield the best results in that regime. We also show how additional properties such as their ability to incorporate local features at multiple spatial scales, e.g., as done in CNNs through max pooling, and to provide the benefits of composition through the architecture of multiple layers, can also be embedded into SVMs. We verify through experiments on widely available image sets that the resulting SVMs do provide superior accuracy in comparison to well-established deep neural network benchmarks for small sample sizes.
    Can Mean Field Control (MFC) Approximate Cooperative Multi Agent Reinforcement Learning (MARL) with Non-Uniform Interaction?. (arXiv:2203.00035v2 [cs.LG] UPDATED)
    Mean-Field Control (MFC) is a powerful tool to solve Multi-Agent Reinforcement Learning (MARL) problems. Recent studies have shown that MFC can well-approximate MARL when the population size is large and the agents are exchangeable. Unfortunately, the presumption of exchangeability implies that all agents uniformly interact with one another which is not true in many practical scenarios. In this article, we relax the assumption of exchangeability and model the interaction between agents via an arbitrary doubly stochastic matrix. As a result, in our framework, the mean-field `seen' by different agents are different. We prove that, if the reward of each agent is an affine function of the mean-field seen by that agent, then one can approximate such a non-uniform MARL problem via its associated MFC problem within an error of $e=\mathcal{O}(\frac{1}{\sqrt{N}}[\sqrt{|\mathcal{X}|} + \sqrt{|\mathcal{U}|}])$ where $N$ is the population size and $|\mathcal{X}|$, $|\mathcal{U}|$ are the sizes of state and action spaces respectively. Finally, we develop a Natural Policy Gradient (NPG) algorithm that can provide a solution to the non-uniform MARL with an error $\mathcal{O}(\max\{e,\epsilon\})$ and a sample complexity of $\mathcal{O}(\epsilon^{-3})$ for any $\epsilon >0$.
    Neural Dual Contouring. (arXiv:2202.01999v3 [cs.CV] UPDATED)
    We introduce neural dual contouring (NDC), a new data-driven approach to mesh reconstruction based on dual contouring (DC). Like traditional DC, it produces exactly one vertex per grid cell and one quad for each grid edge intersection, a natural and efficient structure for reproducing sharp features. However, rather than computing vertex locations and edge crossings with hand-crafted functions that depend directly on difficult-to-obtain surface gradients, NDC uses a neural network to predict them. As a result, NDC can be trained to produce meshes from signed or unsigned distance fields, binary voxel grids, or point clouds (with or without normals); and it can produce open surfaces in cases where the input represents a sheet or partial surface. During experiments with five prominent datasets, we find that NDC, when trained on one of the datasets, generalizes well to the others. Furthermore, NDC provides better surface reconstruction accuracy, feature preservation, output complexity, triangle quality, and inference time in comparison to previous learned (e.g., neural marching cubes, convolutional occupancy networks) and traditional (e.g., Poisson) methods. Code and data are available at https://github.com/czq142857/NDC.
    Graph Self-supervised Learning with Accurate Discrepancy Learning. (arXiv:2202.02989v3 [cs.LG] UPDATED)
    Self-supervised learning of graph neural networks (GNNs) aims to learn an accurate representation of the graphs in an unsupervised manner, to obtain transferable representations of them for diverse downstream tasks. Predictive learning and contrastive learning are the two most prevalent approaches for graph self-supervised learning. However, they have their own drawbacks. While the predictive learning methods can learn the contextual relationships between neighboring nodes and edges, they cannot learn global graph-level similarities. Contrastive learning, while it can learn global graph-level similarities, its objective to maximize the similarity between two differently perturbed graphs may result in representations that cannot discriminate two similar graphs with different properties. To tackle such limitations, we propose a framework that aims to learn the exact discrepancy between the original and the perturbed graphs, coined as Discrepancy-based Self-supervised LeArning (D-SLA). Specifically, we create multiple perturbations of the given graph with varying degrees of similarity, and train the model to predict whether each graph is the original graph or the perturbed one. Moreover, we further aim to accurately capture the amount of discrepancy for each perturbed graph using the graph edit distance. We validate our D-SLA on various graph-related downstream tasks, including molecular property prediction, protein function prediction, and link prediction tasks, on which ours largely outperforms relevant baselines.
    Sampling from Log-Concave Distributions with Infinity-Distance Guarantees. (arXiv:2111.04089v2 [cs.DS] UPDATED)
    For a $d$-dimensional log-concave distribution $\pi(\theta) \propto e^{-f(\theta)}$ constrained to a convex body $K$, the problem of outputting samples from a distribution $\nu$ which is $\varepsilon$-close in infinity-distance $\sup_{\theta \in K} |\log \frac{\nu(\theta)}{\pi(\theta)}|$ to $\pi$ arises in differentially private optimization. While sampling within total-variation distance $\varepsilon$ of $\pi$ can be done by algorithms whose runtime depends polylogarithmically on $\frac{1}{\varepsilon}$, prior algorithms for sampling in $\varepsilon$ infinity distance have runtime bounds that depend polynomially on $\frac{1}{\varepsilon}$. We bridge this gap by presenting an algorithm that outputs a point $\varepsilon$-close to $\pi$ in infinity distance that requires at most $\mathrm{poly}(\log \frac{1}{\varepsilon}, d)$ calls to a membership oracle for $K$ and evaluation oracle for $f$, when $f$ is Lipschitz. Our approach departs from prior works that construct Markov chains on a $\frac{1}{\varepsilon^2}$-discretization of $K$ to achieve a sample with $\varepsilon$ infinity-distance error, and present a method to directly convert continuous samples from $K$ with total-variation bounds to samples with infinity bounds. This approach also allows us to obtain an improvement on the dimension $d$ in the running time for the problem of sampling from a log-concave distribution on polytopes $K$ with infinity distance $\varepsilon$, by plugging in TV-distance running time bounds for the Dikin Walk Markov chain.
    Conformal prediction for the design problem. (arXiv:2202.03613v4 [cs.LG] UPDATED)
    Many applications of machine learning methods involve an iterative protocol in which data are collected, a model is trained, and then outputs of that model are used to choose what data to consider next. For example, one data-driven approach for designing proteins is to train a regression model to predict the fitness of protein sequences, then use it to propose new sequences believed to exhibit greater fitness than observed in the training data. Since validating designed sequences in the wet lab is typically costly, it is important to quantify the uncertainty in the model's predictions. This is challenging because of a characteristic type of distribution shift between the training and test data in the design setting -- one in which the training and test data are statistically dependent, as the latter is chosen based on the former. Consequently, the model's error on the test data -- that is, the designed sequences -- has an unknown and possibly complex relationship with its error on the training data. We introduce a method to quantify predictive uncertainty in such settings. We do so by constructing confidence sets for predictions that account for the dependence between the training and test data. The confidence sets we construct have finite-sample guarantees that hold for any prediction algorithm, even when a trained model chooses the test-time input distribution. As a motivating use case, we demonstrate with several real data sets how our method quantifies uncertainty for the predicted fitness of designed proteins, and can therefore be used to select design algorithms that achieve acceptable trade-offs between high predicted fitness and low predictive uncertainty.
    Transformer with Fourier Integral Attentions. (arXiv:2206.00206v1 [cs.LG])
    Multi-head attention empowers the recent success of transformers, the state-of-the-art models that have achieved remarkable success in sequence modeling and beyond. These attention mechanisms compute the pairwise dot products between the queries and keys, which results from the use of unnormalized Gaussian kernels with the assumption that the queries follow a mixture of Gaussian distribution. There is no guarantee that this assumption is valid in practice. In response, we first interpret attention in transformers as a nonparametric kernel regression. We then propose the FourierFormer, a new class of transformers in which the dot-product kernels are replaced by the novel generalized Fourier integral kernels. Different from the dot-product kernels, where we need to choose a good covariance matrix to capture the dependency of the features of data, the generalized Fourier integral kernels can automatically capture such dependency and remove the need to tune the covariance matrix. We theoretically prove that our proposed Fourier integral kernels can efficiently approximate any key and query distributions. Compared to the conventional transformers with dot-product attention, FourierFormers attain better accuracy and reduce the redundancy between attention heads. We empirically corroborate the advantages of FourierFormers over the baseline transformers in a variety of practical applications including language modeling and image classification.
    Multi-Complexity-Loss DNAS for Energy-Efficient and Memory-Constrained Deep Neural Networks. (arXiv:2206.00302v1 [cs.LG])
    Neural Architecture Search (NAS) is increasingly popular to automatically explore the accuracy versus computational complexity trade-off of Deep Learning (DL) architectures. When targeting tiny edge devices, the main challenge for DL deployment is matching the tight memory constraints, hence most NAS algorithms consider model size as the complexity metric. Other methods reduce the energy or latency of DL models by trading off accuracy and number of inference operations. Energy and memory are rarely considered simultaneously, in particular by low-search-cost Differentiable NAS (DNAS) solutions. We overcome this limitation proposing the first DNAS that directly addresses the most realistic scenario from a designer's perspective: the co-optimization of accuracy and energy (or latency) under a memory constraint, determined by the target HW. We do so by combining two complexity-dependent loss functions during training, with independent strength. Testing on three edge-relevant tasks from the MLPerf Tiny benchmark suite, we obtain rich Pareto sets of architectures in the energy vs. accuracy space, with memory footprints constraints spanning from 75% to 6.25% of the baseline networks. When deployed on a commercial edge device, the STM NUCLEO-H743ZI2, our networks span a range of 2.18x in energy consumption and 4.04% in accuracy for the same memory constraint, and reduce energy by up to 2.2x with negligible accuracy drop with respect to the baseline.
    Neonatal Bowel Sound Detection Using Convolutional Neural Network and Laplace Hidden Semi-Markov Model. (arXiv:2108.07467v3 [cs.SD] UPDATED)
    Abdominal auscultation is a convenient, safe and inexpensive method to assess bowel conditions, which is essential in neonatal care. It helps early detection of neonatal bowel dysfunctions and allows timely intervention. This paper presents a neonatal bowel sound detection method to assist the auscultation. Specifically, a Convolutional Neural Network (CNN) is proposed to classify peristalsis and non-peristalsis sounds. The classification is then optimized using a Laplace Hidden Semi-Markov Model (HSMM). The proposed method is validated on abdominal sounds from 49 newborn infants admitted to our tertiary Neonatal Intensive Care Unit (NICU). The results show that the method can effectively detect bowel sounds with accuracy and area under curve (AUC) score being 89.81% and 83.96% respectively, outperforming 13 baseline methods. Furthermore, the proposed Laplace HSMM refinement strategy is proven capable to enhance other bowel sound detection models. The outcomes of this work have the potential to facilitate future telehealth applications for neonatal care. The source code of our work can be found at: https://bitbucket.org/chirudeakin/neonatal-bowel-sound-classification/
    Generative multitask learning mitigates target-causing confounding. (arXiv:2202.04136v2 [cs.LG] UPDATED)
    We propose a simple and scalable approach to causal representation learning for multitask learning. Our approach requires minimal modification to existing ML systems, and improves robustness to target shift. The improvement comes from mitigating unobserved confounders that cause the targets, but not the input. We refer to them as target-causing confounders. These confounders induce spurious dependencies between the input and targets. This poses a problem for the conventional approach to multitask learning, due to its assumption that the targets are conditionally independent given the input. Our proposed approach takes into account the dependencies between the targets in order to alleviate target-causing confounding. All that is required in addition to usual practice is to estimate the joint distribution of the targets to switch from discriminative to generative classification, and to predict all targets jointly. Our results on the Attributes of People and Taskonomy datasets reflect the conceptual improvement in robustness to target shift.
    Human-Algorithm Collaboration: Achieving Complementarity and Avoiding Unfairness. (arXiv:2202.08821v2 [cs.CY] UPDATED)
    Much of machine learning research focuses on predictive accuracy: given a task, create a machine learning model (or algorithm) that maximizes accuracy. In many settings, however, the final prediction or decision of a system is under the control of a human, who uses an algorithm's output along with their own personal expertise in order to produce a combined prediction. One ultimate goal of such collaborative systems is "complementarity": that is, to produce lower loss (equivalently, greater payoff or utility) than either the human or algorithm alone. However, experimental results have shown that even in carefully-designed systems, complementary performance can be elusive. Our work provides three key contributions. First, we provide a theoretical framework for modeling simple human-algorithm systems and demonstrate that multiple prior analyses can be expressed within it. Next, we use this model to prove conditions where complementarity is impossible, and give constructive examples of where complementarity is achievable. Finally, we discuss the implications of our findings, especially with respect to the fairness of a classifier. In sum, these results deepen our understanding of key factors influencing the combined performance of human-algorithm systems, giving insight into how algorithmic tools can best be designed for collaborative environments.
    Optimal Accounting of Differential Privacy via Characteristic Function. (arXiv:2106.08567v3 [cs.LG] UPDATED)
    Characterizing the privacy degradation over compositions, i.e., privacy accounting, is a fundamental topic in differential privacy (DP) with many applications to differentially private machine learning and federated learning. We propose a unification of recent advances (Renyi DP, privacy profiles, $f$-DP and the PLD formalism) via the \emph{characteristic function} ($\phi$-function) of a certain \emph{dominating} privacy loss random variable. We show that our approach allows \emph{natural} adaptive composition like Renyi DP, provides \emph{exactly tight} privacy accounting like PLD, and can be (often \emph{losslessly}) converted to privacy profile and $f$-DP, thus providing $(\epsilon,\delta)$-DP guarantees and interpretable tradeoff functions. Algorithmically, we propose an \emph{analytical Fourier accountant} that represents the \emph{complex} logarithm of $\phi$-functions symbolically and uses Gaussian quadrature for numerical computation. On several popular DP mechanisms and their subsampled counterparts, we demonstrate the flexibility and tightness of our approach in theory and experiments.
    Fair Comparison between Efficient Attentions. (arXiv:2206.00244v1 [cs.CV])
    Transformers have been successfully used in various fields and are becoming the standard tools in computer vision. However, self-attention, a core component of transformers, has a quadratic complexity problem, which limits the use of transformers in various vision tasks that require dense prediction. Many studies aiming at solving this problem have been reported proposed. However, no comparative study of these methods using the same scale has been reported due to different model configurations, training schemes, and new methods. In our paper, we validate these efficient attention models on the ImageNet1K classification task by changing only the attention operation and examining which efficient attention is better.
    Asymptotics of $\ell_2$ Regularized Network Embeddings. (arXiv:2201.01689v2 [stat.ML] UPDATED)
    A common approach to solving prediction tasks on large networks, such as node classification or link prediction, begin by learning a Euclidean embedding of the nodes of the network, from which traditional machine learning methods can then be applied. This includes methods such as DeepWalk and node2vec, which learn embeddings by optimizing stochastic losses formed over subsamples of the graph at each iteration of stochastic gradient descent. In this paper, we study the effects of adding an $\ell_2$ penalty of the embedding vectors to the training loss of these types of methods. We prove that, under some exchangeability assumptions on the graph, this asymptotically leads to learning a graphon with a nuclear-norm-type penalty, and give guarantees for the asymptotic distribution of the learned embedding vectors. In particular, the exact form of the penalty depends on the choice of subsampling method used as part of stochastic gradient descent. We also illustrate empirically that concatenating node covariates to $\ell_2$ regularized node2vec embeddings leads to comparable, when not superior, performance to methods which incorporate node covariates and the network structure in a non-linear manner.
    Bayesian Optimisation for Robust Model Predictive Control under Model Parameter Uncertainty. (arXiv:2203.00551v3 [cs.RO] UPDATED)
    We propose an adaptive optimisation approach for tuning stochastic model predictive control (MPC) hyper-parameters while jointly estimating probability distributions of the transition model parameters based on performance rewards. In particular, we develop a Bayesian optimisation (BO) algorithm with a heteroscedastic noise model to deal with varying noise across the MPC hyper-parameter and dynamics model parameter spaces. Typical homoscedastic noise models are unrealistic for tuning MPC since stochastic controllers are inherently noisy, and the level of noise is affected by their hyper-parameter settings. We evaluate the proposed optimisation algorithm in simulated control and robotics tasks where we jointly infer control and dynamics parameters. Experimental results demonstrate that our approach leads to higher cumulative rewards and more stable controllers.
    Neural Network Verification with Proof Production. (arXiv:2206.00512v1 [cs.LO])
    Deep neural networks (DNNs) are increasingly being employed in safety-critical systems, and there is an urgent need to guarantee their correctness. Consequently, the verification community has devised multiple techniques and tools for verifying DNNs. When DNN verifiers discover an input that triggers an error, that is easy to confirm; but when they report that no error exists, there is no way to ensure that the verification tool itself is not flawed. As multiple errors have already been observed in DNN verification tools, this calls the applicability of DNN verification into question. In this work, we present a novel mechanism for enhancing Simplex-based DNN verifiers with proof production capabilities: the generation of an easy-to-check witness of unsatisfiability, which attests to the absence of errors. Our proof production is based on an efficient adaptation of the well-known Farkas' lemma, combined with mechanisms for handling piecewise-linear functions and numerical precision errors. As a proof of concept, we implemented our technique on top of the Marabou DNN verifier. Our evaluation on a safety-critical system for airborne collision avoidance shows that proof production succeeds in almost all cases and requires only minimal overhead.  ( 2 min )
    Rotate the ReLU to implicitly sparsify deep networks. (arXiv:2206.00488v1 [cs.LG])
    In the era of Deep Neural Network based solutions for a variety of real-life tasks, having a compact and energy-efficient deployable model has become fairly important. Most of the existing deep architectures use Rectifier Linear Unit (ReLU) activation. In this paper, we propose a novel idea of rotating the ReLU activation to give one more degree of freedom to the architecture. We show that this activation wherein the rotation is learned via training results in the elimination of those parameters/filters in the network which are not important for the task. In other words, rotated ReLU seems to be doing implicit sparsification. The slopes of the rotated ReLU activations act as coarse feature extractors and unnecessary features can be eliminated before retraining. Our studies indicate that features always choose to pass through a lesser number of filters in architectures such as ResNet and its variants. Hence, by rotating the ReLU, the weights or the filters that are not necessary are automatically identified and can be dropped thus giving rise to significant savings in memory and computation. Furthermore, in some cases, we also notice that along with saving in memory and computation we also obtain improvement over the reported performance of the corresponding baseline work in the popular datasets such as MNIST, CIFAR-10, CIFAR-100, and SVHN.  ( 2 min )
    A Transformer-based Network for Deformable Medical Image Registration. (arXiv:2202.12104v2 [eess.IV] UPDATED)
    Deformable medical image registration plays an important role in clinical diagnosis and treatment. Recently, the deep learning (DL) based image registration methods have been widely investigated and showed excellent performance in computational speed. However, these methods cannot provide enough registration accuracy because of insufficient ability in representing both the global and local features of the moving and fixed images. To address this issue, this paper has proposed the transformer based image registration method. This method uses the distinctive transformer to extract the global and local image features for generating the deformation fields, based on which the registered image is produced in an unsupervised way. Our method can improve the registration accuracy effectively by means of self-attention mechanism and bi-level information flow. Experimental results on such brain MR image datasets as LPBA40 and OASIS-1 demonstrate that compared with several traditional and DL based registration methods, our method provides higher registration accuracy in terms of dice values.  ( 2 min )
    On the Perils of Cascading Robust Classifiers. (arXiv:2206.00278v1 [cs.LG])
    Ensembling certifiably robust neural networks has been shown to be a promising approach for improving the \emph{certified robust accuracy} of neural models. Black-box ensembles that assume only query-access to the constituent models (and their robustness certifiers) during prediction are particularly attractive due to their modular structure. Cascading ensembles are a popular instance of black-box ensembles that appear to improve certified robust accuracies in practice. However, we find that the robustness certifier used by a cascading ensemble is unsound. That is, when a cascading ensemble is certified as locally robust at an input $x$, there can, in fact, be inputs $x'$ in the $\epsilon$-ball centered at $x$, such that the cascade's prediction at $x'$ is different from $x$. We present an alternate black-box ensembling mechanism based on weighted voting which we prove to be sound for robustness certification. Via a thought experiment, we demonstrate that if the constituent classifiers are suitably diverse, voting ensembles can improve certified performance. Our code is available at \url{https://github.com/TristaChi/ensembleKW}.  ( 2 min )
    The Dimpled Manifold Model of Adversarial Examples in Machine Learning. (arXiv:2106.10151v2 [cs.LG] UPDATED)
    The extreme fragility of deep neural networks, when presented with tiny perturbations in their inputs, was independently discovered by several research groups in 2013. However, despite enormous effort, these adversarial examples remained a counterintuitive phenomenon with no simple testable explanation. In this paper, we introduce a new conceptual framework for how the decision boundary between classes evolves during training, which we call the {\em Dimpled Manifold Model}. In particular, we demonstrate that training is divided into two distinct phases. The first phase is a (typically fast) clinging process in which the initially randomly oriented decision boundary gets very close to the low dimensional image manifold, which contains all the training examples. Next, there is a (typically slow) dimpling phase which creates shallow bulges in the decision boundary that move it to the correct side of the training examples. This framework provides a simple explanation for why adversarial examples exist, why their perturbations have such tiny norms, and why they look like random noise rather than like the target class. This explanation is also used to show that a network that was adversarially trained with incorrectly labeled images might still correctly classify most test images, and to show that the main effect of adversarial training is just to deepen the generated dimples in the decision boundary. Finally, we discuss and demonstrate the very different properties of on-manifold and off-manifold adversarial perturbations. We describe the results of numerous experiments which strongly support this new model, using both low dimensional synthetic datasets and high dimensional natural datasets.  ( 2 min )
    Hopular: Modern Hopfield Networks for Tabular Data. (arXiv:2206.00664v1 [cs.LG])
    While Deep Learning excels in structured data as encountered in vision and natural language processing, it failed to meet its expectations on tabular data. For tabular data, Support Vector Machines (SVMs), Random Forests, and Gradient Boosting are the best performing techniques with Gradient Boosting in the lead. Recently, we saw a surge of Deep Learning methods that were tailored to tabular data but still underperform compared to Gradient Boosting on small-sized datasets. We suggest "Hopular", a novel Deep Learning architecture for medium- and small-sized datasets, where each layer is equipped with continuous modern Hopfield networks. The modern Hopfield networks use stored data to identify feature-feature, feature-target, and sample-sample dependencies. Hopular's novelty is that every layer can directly access the original input as well as the whole training set via stored data in the Hopfield networks. Therefore, Hopular can step-wise update its current model and the resulting prediction at every layer like standard iterative learning algorithms. In experiments on small-sized tabular datasets with less than 1,000 samples, Hopular surpasses Gradient Boosting, Random Forests, SVMs, and in particular several Deep Learning methods. In experiments on medium-sized tabular data with about 10,000 samples, Hopular outperforms XGBoost, CatBoost, LightGBM and a state-of-the art Deep Learning method designed for tabular data. Thus, Hopular is a strong alternative to these methods on tabular data.  ( 2 min )
    A Comparison of Different Approaches to Dynamic Origin-Destination Matrix Estimation in Urban Traffic. (arXiv:2202.00099v2 [math.OC] UPDATED)
    Given the counters of vehicles that traverse the roads of a traffic network, we reconstruct the travel demand that generated them expressed in terms of the number of origin-destination trips made by users. We model the problem as a bi-level optimization problem. At the inner-level, given a tentative demand, we solve a Dynamic Traffic Assignment (DTA) problem to decide the routing of the users between their origins and destinations. Finally, we adjust the number of trips and their origins and destinations at the outer-level to minimize the discrepancy between the counters generated at the inner-level and the given vehicle counts measured by sensors in the traffic network. We solve the DTA problem by employing a mesoscopic model implemented by the traffic simulator SUMO. Thus, the outer problem becomes an optimization problem that minimizes a black-box Objective Function (OF) determined by the results of the simulation, which is a costly computation. We study different approaches to the outer-level problem categorized as gradient-based and derivative-free approaches. Among the gradient-based approaches, we look at an assignment matrix-based approach and an assignment matrix-free approach that uses the Simultaneous Perturbation Stochastic Approximation (SPSA) algorithm. Among the derivative-free approaches, we investigate Machine Learning (ML) algorithms to learn a model of the simulator that can then be used as a surrogate OF in the optimization problem. We compare these approaches computationally on an artificial network. The gradient-based approaches perform the best in terms of solution quality and computational requirements. In contrast, the results obtained by the ML approach are currently less satisfactory but provide an interesting avenue for future research.
    FETA: Fairness Enforced Verifying, Training, and Predicting Algorithms for Neural Networks. (arXiv:2206.00553v1 [cs.LG])
    Algorithmic decision making driven by neural networks has become very prominent in applications that directly affect people's quality of life. In this paper, we study the problem of verifying, training, and guaranteeing individual fairness of neural network models. A popular approach for enforcing fairness is to translate a fairness notion into constraints over the parameters of the model. However, such a translation does not always guarantee fair predictions of the trained neural network model. To address this challenge, we develop a counterexample-guided post-processing technique to provably enforce fairness constraints at prediction time. Contrary to prior work that enforces fairness only on points around test or train data, we are able to enforce and guarantee fairness on all points in the input domain. Additionally, we propose an in-processing technique to use fairness as an inductive bias by iteratively incorporating fairness counterexamples in the learning process. We have implemented these techniques in a tool called FETA. Empirical evaluation on real-world datasets indicates that FETA is not only able to guarantee fairness on-the-fly at prediction time but also is able to train accurate models exhibiting a much higher degree of individual fairness.
    Asymptotics of Network Embeddings Learned via Subsampling. (arXiv:2107.02363v2 [stat.ML] UPDATED)
    Network data are ubiquitous in modern machine learning, with tasks of interest including node classification, node clustering and link prediction. A frequent approach begins by learning an Euclidean embedding of the network, to which algorithms developed for vector-valued data are applied. For large networks, embeddings are learned using stochastic gradient methods where the sub-sampling scheme can be freely chosen. Despite the strong empirical performance of such methods, they are not well understood theoretically. Our work encapsulates representation methods using a subsampling approach, such as node2vec, into a single unifying framework. We prove, under the assumption that the graph is exchangeable, that the distribution of the learned embedding vectors asymptotically decouples. Moreover, we characterize the asymptotic distribution and provided rates of convergence, in terms of the latent parameters, which includes the choice of loss function and the embedding dimension. This provides a theoretical foundation to understand what the embedding vectors represent and how well these methods perform on downstream tasks. Notably, we observe that typically used loss functions may lead to shortcomings, such as a lack of Fisher consistency.
    Asynchronous Hierarchical Federated Learning. (arXiv:2206.00054v1 [cs.LG])
    Federated Learning is a rapidly growing area of research and with various benefits and industry applications. Typical federated patterns have some intrinsic issues such as heavy server traffic, long periods of convergence, and unreliable accuracy. In this paper, we address these issues by proposing asynchronous hierarchical federated learning, in which the central server uses either the network topology or some clustering algorithm to assign clusters for workers (i.e., client devices). In each cluster, a special aggregator device is selected to enable hierarchical learning, leads to efficient communication between server and workers, so that the burden of the server can be significantly reduced. In addition, asynchronous federated learning schema is used to tolerate heterogeneity of the system and achieve fast convergence, i.e., the server aggregates the gradients from the workers weighted by a staleness parameter to update the global model, and regularized stochastic gradient descent is performed in workers, so that the instability of asynchronous learning can be alleviated. We evaluate the proposed algorithm on CIFAR-10 image classification task, the experimental results demonstrate the effectiveness of asynchronous hierarchical federated learning.
    Calibrate and Debias Layer-wise Sampling for Graph Convolutional Networks. (arXiv:2206.00583v1 [cs.LG])
    To accelerate the training of graph convolutional networks (GCNs), many sampling-based methods have been developed for approximating the embedding aggregation. Among them, a layer-wise approach recursively performs importance sampling to select neighbors jointly for existing nodes in each layer. This paper revisits the approach from a matrix approximation perspective. We identify two issues in the existing layer-wise sampling methods: sub-optimal sampling probabilities and the approximation bias induced by sampling without replacement. We propose two remedies: new sampling probabilities and a debiasing algorithm, to address these issues, and provide the statistical analysis of the estimation variance. The improvements are demonstrated by extensive analyses and experiments on common benchmarks.
    Graph Neural Networks with Precomputed Node Features. (arXiv:2206.00637v1 [cs.LG])
    Most Graph Neural Networks (GNNs) cannot distinguish some graphs or indeed some pairs of nodes within a graph. This makes it impossible to solve certain classification tasks. However, adding additional node features to these models can resolve this problem. We introduce several such augmentations, including (i) positional node embeddings, (ii) canonical node IDs, and (iii) random features. These extensions are motivated by theoretical results and corroborated by extensive testing on synthetic subgraph detection tasks. We find that positional embeddings significantly outperform other extensions in these tasks. Moreover, positional embeddings have better sample efficiency, perform well on different graph distributions and even outperform learning with ground truth node positions. Finally, we show that the different augmentations perform competitively on established GNN benchmarks, and advise on when to use them.
    Consistent Collaborative Filtering via Tensor Decomposition. (arXiv:2201.11936v2 [cs.IR] UPDATED)
    Collaborative filtering is the de facto standard for analyzing users' activities and building recommendation systems for items. In this work we develop Sliced Anti-symmetric Decomposition (SAD), a new model for collaborative filtering based on implicit feedback. In contrast to traditional techniques where a latent representation of users (user vectors) and items (item vectors) are estimated, SAD introduces one additional latent vector to each item, using a novel three-way tensor view of user-item interactions. This new vector extends user-item preferences calculated by standard dot products to general inner products, producing interactions between items when evaluating their relative preferences and bringing fundamental new information into recommendation. SAD reduces to state-of-the-art (SOTA) collaborative filtering models when the vector collapses to 1 (no new information), while in this paper we allow its value to be estimated from data. The proposed SAD model is simple, resulting in an efficient group stochastic gradient descent (SGD) algorithm. We demonstrate the efficiency of SAD in both simulated and real world datasets containing over 1M user-item interactions. By comparing SAD with seven alternative SOTA collaborative filtering models, we show that SAD is not only able to more consistently estimate personalized preferences, but also produce more accurate personalized recommendations. We release the model and inference algorithms in a Python library https://github.com/apple/ml-sad.
    NeuroUnlock: Unlocking the Architecture of Obfuscated Deep Neural Networks. (arXiv:2206.00402v1 [cs.CR])
    The advancements of deep neural networks (DNNs) have led to their deployment in diverse settings, including safety and security-critical applications. As a result, the characteristics of these models have become sensitive intellectual properties that require protection from malicious users. Extracting the architecture of a DNN through leaky side-channels (e.g., memory access) allows adversaries to (i) clone the model, and (ii) craft adversarial attacks. DNN obfuscation thwarts side-channel-based architecture stealing (SCAS) attacks by altering the run-time traces of a given DNN while preserving its functionality. In this work, we expose the vulnerability of state-of-the-art DNN obfuscation methods to these attacks. We present NeuroUnlock, a novel SCAS attack against obfuscated DNNs. Our NeuroUnlock employs a sequence-to-sequence model that learns the obfuscation procedure and automatically reverts it, thereby recovering the original DNN architecture. We demonstrate the effectiveness of NeuroUnlock by recovering the architecture of 200 randomly generated and obfuscated DNNs running on the Nvidia RTX 2080 TI graphics processing unit (GPU). Moreover, NeuroUnlock recovers the architecture of various other obfuscated DNNs, such as the VGG-11, VGG-13, ResNet-20, and ResNet-32 networks. After recovering the architecture, NeuroUnlock automatically builds a near-equivalent DNN with only a 1.4% drop in the testing accuracy. We further show that launching a subsequent adversarial attack on the recovered DNNs boosts the success rate of the adversarial attack by 51.7% in average compared to launching it on the obfuscated versions. Additionally, we propose a novel methodology for DNN obfuscation, ReDLock, which eradicates the deterministic nature of the obfuscation and achieves 2.16X more resilience to the NeuroUnlock attack. We release the NeuroUnlock and the ReDLock as open-source frameworks.
    Learning-Augmented Algorithms for Online TSP on the Line. (arXiv:2206.00655v1 [cs.DS])
    We study the online Traveling Salesman Problem (TSP) on the line augmented with machine-learned predictions. In the classical problem, there is a stream of requests released over time along the real line. The goal is to minimize the makespan of the algorithm. We distinguish between the open variant and the closed one, in which we additionally require the algorithm to return to the origin after serving all requests. The state of the art is a $1.64$-competitive algorithm and a $2.04$-competitive algorithm for the closed and open variants, respectively \cite{Bjelde:1.64}. In both cases, a tight lower bound is known \cite{Ausiello:1.75, Bjelde:1.64}. In both variants, our primary prediction model involves predicted positions of the requests. We introduce algorithms that (i) obtain a tight 1.5 competitive ratio for the closed variant and a 1.66 competitive ratio for the open variant in the case of perfect predictions, (ii) are robust against unbounded prediction error, and (iii) are smooth, i.e., their performance degrades gracefully as the prediction error increases. Moreover, we further investigate the learning-augmented setting in the open variant by additionally considering a prediction for the last request served by the optimal offline algorithm. Our algorithm for this enhanced setting obtains a 1.33 competitive ratio with perfect predictions while also being smooth and robust, beating the lower bound of 1.44 we show for our original prediction setting for the open variant. Also, we provide a lower bound of 1.25 for this enhanced setting.
    High-Throughput Approach to Modeling Healthcare Costs Using Electronic Healthcare Records. (arXiv:2011.09497v2 [cs.LG] UPDATED)
    Accurate estimation of healthcare costs is crucial for healthcare systems to plan and effectively negotiate with insurance companies regarding the coverage of patient-care costs. Greater accuracy in estimating healthcare costs would provide mutual benefit for both health systems and the insurers that support these systems by better aligning payment models with patient-care costs. This study presents the results of a generalizable machine learning approach to predicting medical events built from 40 years of data from >860,000 patients pertaining to >6,700 prescription medications, courtesy of Marshfield Clinic in Wisconsin. It was found that models built using this approach performed well when compared to similar studies predicting physician prescriptions of individual medications. In addition to providing a comprehensive predictive model for all drugs in a large healthcare system, the approach taken in this research benefits from potential applicability to a wide variety of other medical events.
    Sparse Bayesian Deep Learning for Dynamic System Identification. (arXiv:2107.12910v2 [eess.SY] UPDATED)
    This paper proposes a sparse Bayesian treatment of deep neural networks (DNNs) for system identification. Although DNNs show impressive approximation ability in various fields, several challenges still exist for system identification problems. First, DNNs are known to be too complex that they can easily overfit the training data. Second, the selection of the input regressors for system identification is nontrivial. Third, uncertainty quantification of the model parameters and predictions are necessary. The proposed Bayesian approach offers a principled way to alleviate the above challenges by marginal likelihood/model evidence approximation and structured group sparsity-inducing priors construction. The identification algorithm is derived as an iterative regularised optimisation procedure that can be solved as efficiently as training typical DNNs. Remarkably, an efficient and recursive Hessian calculation method for each layer of DNNs is developed, turning the intractable training/optimisation process into a tractable one. Furthermore, a practical calculation approach based on the Monte-Carlo integration method is derived to quantify the uncertainty of the parameters and predictions. The effectiveness of the proposed Bayesian approach is demonstrated on several linear and nonlinear system identification benchmarks by achieving good and competitive simulation accuracy. The code to reproduce the experimental results is open-sourced and available online.
    On the Choice of Data for Efficient Training and Validation of End-to-End Driving Models. (arXiv:2206.00608v1 [cs.CV])
    The emergence of data-driven machine learning (ML) has facilitated significant progress in many complicated tasks such as highly-automated driving. While much effort is put into improving the ML models and learning algorithms in such applications, little focus is put into how the training data and/or validation setting should be designed. In this paper we investigate the influence of several data design choices regarding training and validation of deep driving models trainable in an end-to-end fashion. Specifically, (i) we investigate how the amount of training data influences the final driving performance, and which performance limitations are induced through currently used mechanisms to generate training data. (ii) Further, we show by correlation analysis, which validation design enables the driving performance measured during validation to generalize well to unknown test environments. (iii) Finally, we investigate the effect of random seeding and non-determinism, giving insights which reported improvements can be deemed significant. Our evaluations using the popular CARLA simulator provide recommendations regarding data generation and driving route selection for an efficient future development of end-to-end driving models.
    From block-Toeplitz matrices to differential equations on graphs: towards a general theory for scalable masked Transformers. (arXiv:2107.07999v4 [cs.LG] UPDATED)
    In this paper we provide, to the best of our knowledge, the first comprehensive approach for incorporating various masking mechanisms into Transformers architectures in a scalable way. We show that recent results on linear causal attention (Choromanski et al., 2021) and log-linear RPE-attention (Luo et al., 2021) are special cases of this general mechanism. However by casting the problem as a topological (graph-based) modulation of unmasked attention, we obtain several results unknown before, including efficient d-dimensional RPE-masking and graph-kernel masking. We leverage many mathematical techniques ranging from spectral analysis through dynamic programming and random walks to new algorithms for solving Markov processes on graphs. We provide a corresponding empirical evaluation.
    Algorithmic Fairness Verification with Graphical Models. (arXiv:2109.09447v2 [cs.LG] UPDATED)
    In recent years, machine learning (ML) algorithms have been deployed in safety-critical and high-stake decision-making, where the fairness of algorithms is of paramount importance. Fairness in ML centers on detecting bias towards certain demographic populations induced by an ML classifier and proposes algorithmic solutions to mitigate the bias with respect to different fairness definitions. To this end, several fairness verifiers have been proposed that compute the bias in the prediction of an ML classifier--essentially beyond a finite dataset--given the probability distribution of input features. In the context of verifying linear classifiers, existing fairness verifiers are limited by accuracy due to imprecise modeling of correlations among features and scalability due to restrictive formulations of the classifiers as SSAT/SMT formulas or by sampling. In this paper, we propose an efficient fairness verifier, called FVGM, that encodes the correlations among features as a Bayesian network. In contrast to existing verifiers, FVGM proposes a stochastic subset-sum based approach for verifying linear classifiers. Experimentally, we show that FVGM leads to an accurate and scalable assessment for more diverse families of fairness-enhancing algorithms, fairness attacks, and group/causal fairness metrics than the state-of-the-art fairness verifiers. We also demonstrate that FVGM facilitates the computation of fairness influence functions as a stepping stone to detect the source of bias induced by subsets of features.  ( 2 min )
    Physics-Informed Neural Nets for Control of Dynamical Systems. (arXiv:2104.02556v3 [cs.LG] UPDATED)
    Physics-informed neural networks (PINNs) impose known physical laws into the learning of deep neural networks, making sure they respect the physics of the process while decreasing the demand of labeled data. For systems represented by Ordinary Differential Equations (ODEs), the conventional PINN has a continuous time input variable and outputs the solution of the corresponding ODE. In their original form, PINNs do not allow control inputs, neither can they simulate for variable long-range intervals without serious degradation in their predictions. In this context, this work presents a new framework called Physics-Informed Neural Nets for Control (PINC), which proposes a novel PINN-based architecture that is amenable to control problems and able to simulate for longer-range time horizons that are not fixed beforehand, making it a very flexible framework when compared to traditional PINNs. Furthermore, this long-range time simulation of differential equations is faster than numerical methods since it relies only on signal propagation through the network, making it less computationally costly and, thus, a better alternative for simulation of models in Model Predictive Control. We showcase our proposal in the control of two nonlinear dynamic systems: the Van der Pol oscillator and the four-tank system.  ( 2 min )
    Deep Learning Opacity in Scientific Discovery. (arXiv:2206.00520v1 [cs.AI])
    Philosophers have recently focused on critical, epistemological challenges that arise from the opacity of deep neural networks. One might conclude from this literature that doing good science with opaque models is exceptionally challenging, if not impossible. Yet, this is hard to square with the recent boom in optimism for AI in science alongside a flood of recent scientific breakthroughs driven by AI methods. In this paper, I argue that the disconnect between philosophical pessimism and scientific optimism is driven by a failure to examine how AI is actually used in science. I show that, in order to understand the epistemic justification for AI-powered breakthroughs, philosophers must examine the role played by deep learning as part of a wider process of discovery. The philosophical distinction between the 'context of discovery' and the 'context of justification' is helpful in this regard. I demonstrate the importance of attending to this distinction with two cases drawn from the scientific literature, and show that epistemic opacity need not diminish AI's capacity to lead scientists to significant and justifiable breakthroughs.  ( 2 min )
    Algorithmic Foundation of Deep X-Risk Optimization. (arXiv:2206.00439v1 [cs.LG])
    X-risk is a term introduced to represent a family of compositional measures or objectives, in which each data point is compared with a set of data points explicitly or implicitly for defining a risk function. It includes many widely used measures or objectives, e.g., AUROC, AUPRC, partial AUROC, NDCG, MAP, top-$K$ NDCG, top-$K$ MAP, listwise losses, p-norm push, top push, precision/recall at top $K$ positions, precision at a certain recall level, contrastive objectives, etc. While these measures/objectives and their optimization algorithms have been studied in the literature of machine learning, computer vision, information retrieval, and etc, optimizing these measures/objectives has encountered some unique challenges for deep learning. In this technical report, we survey our recent rigorous efforts for deep X-risk optimization (DXO) by focusing on its algorithmic foundation. We introduce a class of techniques for optimizing X-risk for deep learning. We formulate DXO into three special families of non-convex optimization problems belonging to non-convex min-max optimization, non-convex compositional optimization, and non-convex bilevel optimization, respectively. For each family of problems, we present some strong baseline algorithms and their complexities, which will motivate further research for improving the existing results. Discussions about the presented results and future studies are given at the end.  ( 2 min )
    Feature Selection for Discovering Distributional Treatment Effect Modifiers. (arXiv:2206.00516v1 [cs.LG])
    Finding the features relevant to the difference in treatment effects is essential to unveil the underlying causal mechanisms. Existing methods seek such features by measuring how greatly the feature attributes affect the degree of the {\it conditional average treatment effect} (CATE). However, these methods may overlook important features because CATE, a measure of the average treatment effect, cannot detect differences in distribution parameters other than the mean (e.g., variance). To resolve this weakness of existing methods, we propose a feature selection framework for discovering {\it distributional treatment effect modifiers}. We first formulate a feature importance measure that quantifies how strongly the feature attributes influence the discrepancy between potential outcome distributions. Then we derive its computationally efficient estimator and develop a feature selection algorithm that can control the type I error rate to the desired level. Experimental results show that our framework successfully discovers important features and outperforms the existing mean-based method.  ( 2 min )
    Support Vector Machines under Adversarial Label Contamination. (arXiv:2206.00352v1 [cs.LG])
    Machine learning algorithms are increasingly being applied in security-related tasks such as spam and malware detection, although their security properties against deliberate attacks have not yet been widely understood. Intelligent and adaptive attackers may indeed exploit specific vulnerabilities exposed by machine learning techniques to violate system security. Being robust to adversarial data manipulation is thus an important, additional requirement for machine learning algorithms to successfully operate in adversarial settings. In this work, we evaluate the security of Support Vector Machines (SVMs) to well-crafted, adversarial label noise attacks. In particular, we consider an attacker that aims to maximize the SVM's classification error by flipping a number of labels in the training data. We formalize a corresponding optimal attack strategy, and solve it by means of heuristic approaches to keep the computational complexity tractable. We report an extensive experimental analysis on the effectiveness of the considered attacks against linear and non-linear SVMs, both on synthetic and real-world datasets. We finally argue that our approach can also provide useful insights for developing more secure SVM learning algorithms, and also novel techniques in a number of related research areas, such as semi-supervised and active learning.  ( 2 min )
    Elucidating the Design Space of Diffusion-Based Generative Models. (arXiv:2206.00364v1 [cs.CV])
    We argue that the theory and practice of diffusion-based generative models are currently unnecessarily convoluted and seek to remedy the situation by presenting a design space that clearly separates the concrete design choices. This lets us identify several changes to both the sampling and training processes, as well as preconditioning of the score networks. Together, our improvements yield new state-of-the-art FID of 1.79 for CIFAR-10 in a class-conditional setting and 1.97 in an unconditional setting, with much faster sampling (35 network evaluations per image) than prior designs. To further demonstrate their modular nature, we show that our design changes dramatically improve both the efficiency and quality obtainable with pre-trained score networks from previous work, including improving the FID of an existing ImageNet-64 model from 2.07 to near-SOTA 1.55.  ( 2 min )
    Neural Improvement Heuristics for Preference Ranking. (arXiv:2206.00383v1 [cs.AI])
    In recent years, Deep Learning based methods have been a revolution in the field of combinatorial optimization. They learn to approximate solutions and constitute an interesting choice when dealing with repetitive problems drawn from similar distributions. Most effort has been devoted to investigating neural constructive methods, while the works that propose neural models to iteratively improve a candidate solution are less frequent. In this paper, we present a Neural Improvement (NI) model for graph-based combinatorial problems that, given an instance and a candidate solution, encodes the problem information by means of edge features. Our model proposes a modification on the pairwise precedence of items to increase the quality of the solution. We demonstrate the practicality of the model by applying it as the building block of a Neural Hill Climber and other trajectory-based methods. The algorithms are used to solve the Preference Ranking Problem and results show that they outperform conventional alternatives in simulated and real-world data. Conducted experiments also reveal that the proposed model can be a milestone in the development of efficiently guided trajectory-based optimization algorithms.  ( 2 min )
    Byzantine-Robust Online and Offline Distributed Reinforcement Learning. (arXiv:2206.00165v1 [cs.LG])
    We consider a distributed reinforcement learning setting where multiple agents separately explore the environment and communicate their experiences through a central server. However, $\alpha$-fraction of agents are adversarial and can report arbitrary fake information. Critically, these adversarial agents can collude and their fake data can be of any sizes. We desire to robustly identify a near-optimal policy for the underlying Markov decision process in the presence of these adversarial agents. Our main technical contribution is Weighted-Clique, a novel algorithm for the robust mean estimation from batches problem, that can handle arbitrary batch sizes. Building upon this new estimator, in the offline setting, we design a Byzantine-robust distributed pessimistic value iteration algorithm; in the online setting, we design a Byzantine-robust distributed optimistic value iteration algorithm. Both algorithms obtain near-optimal sample complexities and achieve superior robustness guarantee than prior works.  ( 2 min )
    Pre-training via Denoising for Molecular Property Prediction. (arXiv:2206.00133v1 [cs.LG])
    Many important problems involving molecular property prediction from 3D structures have limited data, posing a generalization challenge for neural networks. In this paper, we describe a pre-training technique that utilizes large datasets of 3D molecular structures at equilibrium to learn meaningful representations for downstream tasks. Inspired by recent advances in noise regularization, our pre-training objective is based on denoising. Relying on the well-known link between denoising autoencoders and score-matching, we also show that the objective corresponds to learning a molecular force field -- arising from approximating the physical state distribution with a mixture of Gaussians -- directly from equilibrium structures. Our experiments demonstrate that using this pre-training objective significantly improves performance on multiple benchmarks, achieving a new state-of-the-art on the majority of targets in the widely used QM9 dataset. Our analysis then provides practical insights into the effects of different factors -- dataset sizes, model size and architecture, and the choice of upstream and downstream datasets -- on pre-training.  ( 2 min )
    Optimization with access to auxiliary information. (arXiv:2206.00395v1 [cs.LG])
    We investigate the fundamental optimization question of minimizing a target function $f(x)$ whose gradients are expensive to compute or have limited availability, given access to some auxiliary side function $h(x)$ whose gradients are cheap or more available. This formulation captures many settings of practical relevance such as i) re-using batches in SGD, ii) transfer learning, iii) federated learning, iv) training with compressed models/dropout, etc. We propose two generic new algorithms which are applicable in all these settings and prove using only an assumption on the Hessian similarity between the target and side information that we can benefit from this framework.  ( 2 min )
    Hands-Up: Leveraging Synthetic Data for Hands-On-Wheel Detection. (arXiv:2206.00148v1 [cs.CV])
    Over the past few years there has been major progress in the field of synthetic data generation using simulation based techniques. These methods use high-end graphics engines and physics-based ray-tracing rendering in order to represent the world in 3D and create highly realistic images. Datagen has specialized in the generation of high-quality 3D humans, realistic 3D environments and generation of realistic human motion. This technology has been developed into a data generation platform which we used for these experiments. This work demonstrates the use of synthetic photo-realistic in-cabin data to train a Driver Monitoring System that uses a lightweight neural network to detect whether the driver's hands are on the wheel. We demonstrate that when only a small amount of real data is available, synthetic data can be a simple way to boost performance. Moreover, we adopt the data-centric approach and show how performing error analysis and generating the missing edge-cases in our platform boosts performance. This showcases the ability of human-centric synthetic data to generalize well to the real world, and help train algorithms in computer vision settings where data from the target domain is scarce or hard to collect.  ( 2 min )
    Transferable Reward Learning by Dynamics-Agnostic Discriminator Ensemble. (arXiv:2206.00238v1 [cs.LG])
    Inverse reinforcement learning (IRL) recovers the underlying reward function from expert demonstrations. A generalizable reward function is even desired as it captures the fundamental motivation of the expert. However, classical IRL methods can only recover reward functions coupled with the training dynamics, thus are hard to generalize to a changed environment. Previous dynamics-agnostic reward learning methods have strict assumptions, such as that the reward function has to be state-only. This work proposes a general approach to learn transferable reward functions, Dynamics-Agnostic Discriminator-Ensemble Reward Learning (DARL). Following the adversarial imitation learning (AIL) framework, DARL learns a dynamics-agnostic discriminator on a latent space mapped from the original state-action space. The latent space is learned to contain the least information of the dynamics. Moreover, to reduce the reliance of the discriminator on policies, the reward function is represented as an ensemble of the discriminators during training. We assess DARL in four MuJoCo tasks with dynamics transfer. Empirical results compared with the state-of-the-art AIL methods show that DARL can learn a reward that is more consistent with the true reward, thus obtaining higher environment returns.  ( 2 min )
    A Survey on Deep Learning for Skin Lesion Segmentation. (arXiv:2206.00356v1 [eess.IV])
    Skin cancer is a major public health problem that could benefit from computer-aided diagnosis to reduce the burden of this common disease. Skin lesion segmentation from images is an important step toward achieving this goal. However, the presence of natural and artificial artifacts (e.g., hair and air bubbles), intrinsic factors (e.g., lesion shape and contrast), and variations in image acquisition conditions make skin lesion segmentation a challenging task. Recently, various researchers have explored the applicability of deep learning models to skin lesion segmentation. In this survey, we cross-examine 134 research papers that deal with deep learning based segmentation of skin lesions. We analyze these works along several dimensions, including input data (datasets, preprocessing, and synthetic data generation), model design (architecture, modules, and losses), and evaluation aspects (data annotation requirements and segmentation performance). We discuss these dimensions both from the viewpoint of select seminal works, and from a systematic viewpoint, examining how those choices have influenced current trends, and how their limitations should be addressed. We summarize all examined works in a comprehensive table to facilitate comparisons.  ( 2 min )
    Decompositional Generation Process for Instance-Dependent Partial Label Learning. (arXiv:2204.03845v2 [cs.LG] UPDATED)
    Partial label learning (PLL) is a typical weakly supervised learning problem, where each training example is associated with a set of candidate labels among which only one is true. Most existing PLL approaches assume that the incorrect labels in each training example are randomly picked as the candidate labels and model the generation process of the candidate labels in a simple way. However, these approaches usually do not perform as well as expected due to the fact that the generation process of the candidate labels is always instance-dependent. Therefore, it deserves to be modeled in a refined way. In this paper, we consider instance-dependent PLL and assume that the generation process of the candidate labels could decompose into two sequential parts, where the correct label emerges first in the mind of the annotator but then the incorrect labels related to the feature are also selected with the correct label as candidate labels due to uncertainty of labeling. Motivated by this consideration, we propose a novel PLL method that performs Maximum A Posterior(MAP) based on an explicitly modeled generation process of candidate labels via decomposed probability distribution models. Experiments on benchmark and real-world datasets validate the effectiveness of the proposed method.
    To the Fairness Frontier and Beyond: Identifying, Quantifying, and Optimizing the Fairness-Accuracy Pareto Frontier. (arXiv:2206.00074v1 [stat.ML])
    Algorithmic fairness has emerged as an important consideration when using machine learning to make high-stakes societal decisions. Yet, improved fairness often comes at the expense of model accuracy. While aspects of the fairness-accuracy tradeoff have been studied, most work reports the fairness and accuracy of various models separately; this makes model comparisons nearly impossible without a model-agnostic metric that reflects the balance of the two desiderata. We seek to identify, quantify, and optimize the empirical Pareto frontier of the fairness-accuracy tradeoff. Specifically, we identify and outline the empirical Pareto frontier through Tradeoff-between-Fairness-and-Accuracy (TAF) Curves; we then develop a metric to quantify this Pareto frontier through the weighted area under the TAF Curve which we term the Fairness-Area-Under-the-Curve (FAUC). TAF Curves provide the first empirical, model-agnostic characterization of the Pareto frontier, while FAUC provides the first metric to impartially compare model families on both fairness and accuracy. Both TAF Curves and FAUC can be employed with all group fairness definitions and accuracy measures. Next, we ask: Is it possible to expand the empirical Pareto frontier and thus improve the FAUC for a given collection of fitted models? We answer affirmately by developing a novel fair model stacking framework, FairStacks, that solves a convex program to maximize the accuracy of model ensemble subject to a score-bias constraint. We show that optimizing with FairStacks always expands the empirical Pareto frontier and improves the FAUC; we additionally study other theoretical properties of our proposed approach. Finally, we empirically validate TAF, FAUC, and FairStacks through studies on several real benchmark data sets, showing that FairStacks leads to major improvements in FAUC that outperform existing algorithmic fairness approaches.
    Exploratory Methods for Relation Discovery in Archival Data. (arXiv:2202.11361v2 [cs.LG] UPDATED)
    In this article we propose a holistic approach to discover relations in art historical communities and enrich historians' biographies and archival descriptions with graph patterns relevant to art historiographic enquiry. We use exploratory data analysis to detect patterns, we select features, and we use them to evaluate classification models to predict new relations, to be recommended to archivists during the cataloguing phase. Results show that relations based on biographical information can be addressed with higher precision than relations based on research topics or institutional relations. Deterministic and a priori rules present better results than probabilistic methods.
    Generative Modeling Helps Weak Supervision (and Vice Versa). (arXiv:2203.12023v3 [cs.LG] UPDATED)
    Many promising applications of supervised machine learning face hurdles in the acquisition of labeled data in sufficient quantity and quality, creating an expensive bottleneck. To overcome such limitations, techniques that do not depend on ground truth labels have been studied, including weak supervision and generative modeling. While these techniques would seem to be usable in concert, improving one another, how to build an interface between them is not well-understood. In this work, we propose a model fusing programmatic weak supervision and generative adversarial networks and provide theoretical justification motivating this fusion. The proposed approach captures discrete latent variables in the data alongside the weak supervision derived label estimate. Alignment of the two allows for better modeling of sample-dependent accuracies of the weak supervision sources, improving the estimate of unobserved labels. It is the first approach to enable data augmentation through weakly supervised synthetic images and pseudolabels. Additionally, its learned latent variables can be inspected qualitatively. The model outperforms baseline weak supervision label models on a number of multiclass image classification datasets, improves the quality of generated images, and further improves end-model performance through data augmentation with synthetic samples.
    ForestPrune: Compact Depth-Controlled Tree Ensembles. (arXiv:2206.00128v1 [stat.ML])
    Tree ensembles are versatile supervised learning algorithms that achieve state-of-the-art performance. These models are extremely powerful but can grow to enormous sizes. As a result, tree ensembles are often post-processed to reduce memory footprint and improve interpretability. In this paper, we present ForestPrune, a novel optimization framework that can post-process tree ensembles by pruning depth layers from individual trees. We also develop a new block coordinate descent method to efficiently obtain high-quality solutions to optimization problems under this framework. The number of nodes in a decision tree increases exponentially with tree depth, so pruning deep trees can drastically improve model parsimony. ForestPrune can substantially reduce the space complexity of an ensemble for a minimal cost to performance. The framework supports various weighting schemes and contains just a single hyperparameter to tune. In our experiments, we observe that ForestPrune can reduce model size 20-fold with negligible performance loss.
    FELARE: Fair Scheduling of Machine Learning Applications on Heterogeneous Edge Systems. (arXiv:2206.00065v1 [cs.DC])
    Edge computing enables smart IoT-based systems via concurrent and continuous execution of latency-sensitive machine learning (ML) applications. These edge-based machine learning systems are often battery-powered (i.e., energy-limited). They use heterogeneous resources with diverse computing performance (e.g., CPU, GPU, and/or FPGAs) to fulfill the latency constraints of ML applications. The challenge is to allocate user requests for different ML applications on the Heterogeneous Edge Computing Systems (HEC) with respect to both the energy and latency constraints of these systems. To this end, we study and analyze resource allocation solutions that can increase the on-time task completion rate while considering the energy constraint. Importantly, we investigate edge-friendly (lightweight) multi-objective mapping heuristics that do not become biased toward a particular application type to achieve the objectives; instead, the heuristics consider "fairness" across the concurrent ML applications in their mapping decisions. Performance evaluations demonstrate that the proposed heuristic outperforms widely-used heuristics in heterogeneous systems in terms of the latency and energy objectives, particularly, at low to moderate request arrival rates. We observed 8.9% improvement in on-time task completion rate and 12.6% in energy-saving without imposing any significant overhead on the edge system.
    Star algorithm for NN ensembling. (arXiv:2206.00255v1 [cs.LG])
    Neural network ensembling is a common and robust way to increase model efficiency. In this paper, we propose a new neural network ensemble algorithm based on Audibert's empirical star algorithm. We provide optimal theoretical minimax bound on the excess squared risk. Additionally, we empirically study this algorithm on regression and classification tasks and compare it to most popular ensembling methods.
    An Indirect Rate-Distortion Characterization for Semantic Sources: General Model and the Case of Gaussian Observation. (arXiv:2201.12477v2 [cs.IT] UPDATED)
    A new source model, which consists of an intrinsic state part and an extrinsic observation part, is proposed and its information-theoretic characterization, namely its rate-distortion function, is defined and analyzed. Such a source model is motivated by the recent surge of interest in the semantic aspect of information: the intrinsic state corresponds to the semantic feature of the source, which in general is not observable but can only be inferred from the extrinsic observation. There are two distortion measures, one between the intrinsic state and its reproduction, and the other between the extrinsic observation and its reproduction. Under a given code rate, the tradeoff between these two distortion measures is characterized by the rate-distortion function, which is solved via the indirect rate-distortion theory and is termed as the semantic rate-distortion function of the source. As an application of the general model and its analysis, the case of Gaussian extrinsic observation is studied, assuming a linear relationship between the intrinsic state and the extrinsic observation, under a quadratic distortion structure. The semantic rate-distortion function is shown to be the solution of a convex programming programming problem with respect to an error covariance matrix, and a reverse water-filling type of solution is provided when the model further satisfies a diagonalizability condition.
    RMT-Net: Reject-aware Multi-Task Network for Modeling Missing-not-at-random Data in Financial Credit Scoring. (arXiv:2206.00568v1 [cs.LG])
    In financial credit scoring, loan applications may be approved or rejected. We can only observe default/non-default labels for approved samples but have no observations for rejected samples, which leads to missing-not-at-random selection bias. Machine learning models trained on such biased data are inevitably unreliable. In this work, we find that the default/non-default classification task and the rejection/approval classification task are highly correlated, according to both real-world data study and theoretical analysis. Consequently, the learning of default/non-default can benefit from rejection/approval. Accordingly, we for the first time propose to model the biased credit scoring data with Multi-Task Learning (MTL). Specifically, we propose a novel Reject-aware Multi-Task Network (RMT-Net), which learns the task weights that control the information sharing from the rejection/approval task to the default/non-default task by a gating network based on rejection probabilities. RMT-Net leverages the relation between the two tasks that the larger the rejection probability, the more the default/non-default task needs to learn from the rejection/approval task. Furthermore, we extend RMT-Net to RMT-Net++ for modeling scenarios with multiple rejection/approval strategies. Extensive experiments are conducted on several datasets, and strongly verifies the effectiveness of RMT-Net on both approved and rejected samples. In addition, RMT-Net++ further improves RMT-Net's performances.  ( 2 min )
    One Positive Label is Sufficient: Single-Positive Multi-Label Learning with Label Enhancement. (arXiv:2206.00517v1 [cs.LG])
    Multi-label learning (MLL) learns from the examples each associated with multiple labels simultaneously, where the high cost of annotating all relevant labels for each training example is challenging for real-world applications. To cope with the challenge, we investigate single-positive multi-label learning (SPMLL) where each example is annotated with only one relevant label and show that one can successfully learn a theoretically grounded multi-label classifier for the problem. In this paper, a novel SPMLL method named {\proposed}, i.e., Single-positive MultI-label learning with Label Enhancement, is proposed. Specifically, an unbiased risk estimator is derived, which could be guaranteed to approximately converge to the optimal risk minimizer of fully supervised learning and shows that one positive label of each instance is sufficient to train the predictive model. Then, the corresponding empirical risk estimator is established via recovering the latent soft label as a label enhancement process, where the posterior density of the latent soft labels is approximate to the variational Beta density parameterized by an inference model. Experiments on benchmark datasets validate the effectiveness of the proposed method.
    Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training. (arXiv:2206.00621v1 [cs.CL])
    In this paper, we introduce Cross-View Language Modeling, a simple and effective language model pre-training framework that unifies cross-lingual cross-modal pre-training with shared architectures and objectives. Our approach is motivated by a key observation that cross-lingual and cross-modal pre-training share the same goal of aligning two different views of the same object into a common semantic space. To this end, the cross-view language modeling framework considers both multi-modal data (i.e., image-caption pairs) and multi-lingual data (i.e., parallel sentence pairs) as two different views of the same object, and trains the model to align the two views by maximizing the mutual information between them with conditional masked language modeling and contrastive learning. We pre-train CCLM, a Cross-lingual Cross-modal Language Model, with the cross-view language modeling framework. Empirical results on IGLUE, a multi-lingual multi-modal benchmark, and two multi-lingual image-text retrieval datasets show that while conceptually simpler, CCLM significantly outperforms the prior state-of-the-art with an average absolute improvement of over 10%. Notably, CCLM is the first multi-lingual multi-modal model that surpasses the translate-test performance of representative English vision-language models by zero-shot cross-lingual transfer.  ( 2 min )
    Attention-embedded Quadratic Network (Qttention) for Effective and Interpretable Bearing Fault Diagnosis. (arXiv:2206.00390v1 [cs.LG])
    Bearing fault diagnosis is of great importance to decrease the damage risk of rotating machines and further improve economic profits. Recently, machine learning, represented by deep learning, has made great progress in bearing fault diagnosis. However, applying deep learning to such a task still faces two major problems. On the one hand, deep learning loses its effectiveness when bearing data are noisy or big data are unavailable, making deep learning hard to implement in industrial fields. On the other hand, a deep network is notoriously a black box. It is difficult to know how a model classifies faulty signals from the normal and the physics principle behind the classification. To solve the effectiveness and interpretability issues, we prototype a convolutional network with recently-invented quadratic neurons. This quadratic neuron empowered network can qualify the noisy and small bearing data due to the strong feature representation ability of quadratic neurons. Moreover, we independently derive the attention mechanism from a quadratic neuron, referred to as qttention, by factorizing the learned quadratic function in analogue to the attention, making the model with quadratic neurons inherently interpretable. Experiments on the public and our datasets demonstrate that the proposed network can facilitate effective and interpretable bearing fault diagnosis.
    The robust way to stack and bag: the local Lipschitz way. (arXiv:2206.00513v1 [cs.LG])
    Recent research has established that the local Lipschitz constant of a neural network directly influences its adversarial robustness. We exploit this relationship to construct an ensemble of neural networks which not only improves the accuracy, but also provides increased adversarial robustness. The local Lipschitz constants for two different ensemble methods - bagging and stacking - are derived and the architectures best suited for ensuring adversarial robustness are deduced. The proposed ensemble architectures are tested on MNIST and CIFAR-10 datasets in the presence of white-box attacks, FGSM and PGD. The proposed architecture is found to be more robust than a) a single network and b) traditional ensemble methods.  ( 2 min )
    IDANI: Inference-time Domain Adaptation via Neuron-level Interventions. (arXiv:2206.00259v1 [cs.CL])
    Large pre-trained models are usually fine-tuned on downstream task data, and tested on unseen data. When the train and test data come from different domains, the model is likely to struggle, as it is not adapted to the test domain. We propose a new approach for domain adaptation (DA), using neuron-level interventions: We modify the representation of each test example in specific neurons, resulting in a counterfactual example from the source domain, which the model is more familiar with. The modified example is then fed back into the model. While most other DA methods are applied during training time, ours is applied during inference only, making it more efficient and applicable. Our experiments show that our method improves performance on unseen domains.  ( 2 min )
    Optimisation of Structured Neural Controller Based on Continuous-Time Policy Gradient. (arXiv:2201.06262v3 [cs.LG] UPDATED)
    This study presents a policy optimisation framework for structured nonlinear control of continuous-time (deterministic) dynamic systems. The proposed approach prescribes a structure for the controller based on relevant scientific knowledge (such as Lyapunov stability theory or domain experiences) while considering the tunable elements inside the given structure as the point of parametrisation with neural networks. To optimise a cost represented as a function of the neural network weights, the proposed approach utilises the continuous-time policy gradient method based on adjoint sensitivity analysis as a means for correct and performant computation of cost gradient. This enables combining the stability, robustness, and physical interpretability of an analytically-derived structure for the feedback controller with the representational flexibility and optimised resulting performance provided by machine learning techniques. Such a hybrid paradigm for fixed-structure control synthesis is particularly useful for optimising adaptive nonlinear controllers to achieve improved performance in online operation, an area where the existing theory prevails the design of structure while lacking clear analytical understandings about tuning of the gains and the uncertainty model basis functions that govern the performance characteristics. Numerical experiments on aerospace applications illustrate the utility of the structured nonlinear controller optimisation framework.
    Transfer without Forgetting. (arXiv:2206.00388v1 [cs.LG])
    This work investigates the entanglement between Continual Learning (CL) and Transfer Learning (TL). In particular, we shed light on the widespread application of network pretraining, highlighting that it is itself subject to catastrophic forgetting. Unfortunately, this issue leads to the under-exploitation of knowledge transfer during later tasks. On this ground, we propose Transfer without Forgetting (TwF), a hybrid Continual Transfer Learning approach building upon a fixed pretrained sibling network, which continuously propagates the knowledge inherent in the source domain through a layer-wise loss term. Our experiments indicate that TwF steadily outperforms other CL methods across a variety of settings, averaging a 4.81% gain in Class-Incremental accuracy over a variety of datasets and different buffer sizes.  ( 2 min )
    DEP-RL: Embodied Exploration for Reinforcement Learning in Overactuated and Musculoskeletal Systems. (arXiv:2206.00484v1 [cs.RO])
    Muscle-actuated organisms are capable of learning an unparalleled diversity of dexterous movements despite their vast amount of muscles. Reinforcement learning (RL) on large musculoskeletal models, however, has not been able to show similar performance. We conjecture that ineffective exploration in large overactuated action spaces is a key problem. This is supported by the finding that common exploration noise strategies are inadequate in synthetic examples of overactuated systems. We identify differential extrinsic plasticity (DEP), a method from the domain of self-organization, as being able to induce state-space covering exploration within seconds of interaction. By integrating DEP into RL, we achieve fast learning of reaching and locomotion in musculoskeletal systems, outperforming current approaches in all considered tasks in sample efficiency and robustness.  ( 2 min )
    Operational Adaptation of DNN Classifiers using Elastic Weight Consolidation. (arXiv:2205.00147v2 [cs.LG] UPDATED)
    Autonomous systems (AS) often use Deep Neural Network (DNN) classifiers to allow them to operate in complex, high dimensional, non-linear, and dynamically changing environments. Due to the complexity of these environments, DNN classifiers may output misclassifications as they experience tasks in their operational environments, that were not identified during development. Removing a system from operation and retraining it to include these new tasks becomes economically infeasible as the number of such ASs increases. Additionally, such misclassifications may cause financial loss and safety threats to the AS or to other operators in the environment. In this paper, we propose to reduce such threats by investigating how DNN classifiers can adapt their knowledge to learn new information in the AS's operational environment, using only a limited number of observations encountered sequentially during operation. This allows the AS to adapt to newly encountered information, increasing the AS's classification accuracy and hence its overall reliability. However, retraining DNNs on different observations than used in prior training is known to cause catastrophic forgetting or significant model drift. We investigate how this problem can be controlled by using Elastic Weight Consolidation (EWC) whilst learning from limited new observations. We carry out experiments using original and noisy versions of the MNIST dataset to represent known and new information to DNN classifiers. Results show that using EWC is effective in controlling the process of adaptation to new information, and thus allows for reliable adaption of ASs to new information in their operational environment.  ( 2 min )
    In the Eye of the Beholder: Robust Prediction with Causal User Modeling. (arXiv:2206.00416v1 [cs.LG])
    Accurately predicting the relevance of items to users is crucial to the success of many social platforms. Conventional approaches train models on logged historical data; but recommendation systems, media services, and online marketplaces all exhibit a constant influx of new content -- making relevancy a moving target, to which standard predictive models are not robust. In this paper, we propose a learning framework for relevance prediction that is robust to changes in the data distribution. Our key observation is that robustness can be obtained by accounting for how users causally perceive the environment. We model users as boundedly-rational decision makers whose causal beliefs are encoded by a causal graph, and show how minimal information regarding the graph can be used to contend with distributional changes. Experiments in multiple settings demonstrate the effectiveness of our approach.
    Calibrated Bagging Deep Learning for Image Semantic Segmentation: A Case Study on COVID-19 Chest X-ray Image. (arXiv:2206.00002v1 [eess.IV])
    Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) causes coronavirus disease 2019 (COVID-19). Imaging tests such as chest X-ray (CXR) and computed tomography (CT) can provide useful information to clinical staff for facilitating a diagnosis of COVID-19 in a more efficient and comprehensive manner. As a breakthrough of artificial intelligence (AI), deep learning has been applied to perform COVID-19 infection region segmentation and disease classification by analyzing CXR and CT data. However, prediction uncertainty of deep learning models for these tasks, which is very important to safety-critical applications like medical image processing, has not been comprehensively investigated. In this work, we propose a novel ensemble deep learning model through integrating bagging deep learning and model calibration to not only enhance segmentation performance, but also reduce prediction uncertainty. The proposed method has been validated on a large dataset that is associated with CXR image segmentation. Experimental results demonstrate that the proposed method can improve the segmentation performance, as well as decrease prediction uncertainties.  ( 2 min )
    Good Intentions: Adaptive Parameter Servers via Intent Signaling. (arXiv:2206.00470v1 [cs.LG])
    Parameter servers (PSs) ease the implementation of distributed training for large machine learning (ML) tasks by providing primitives for shared parameter access. Especially for ML tasks that access parameters sparsely, PSs can achieve high efficiency and scalability. To do so, they employ a number of techniques -- such as replication or relocation -- to reduce communication cost and/or latency of parameter accesses. A suitable choice and parameterization of these techniques is crucial to realize these gains, however. Unfortunately, such choices depend on the task, the workload, and even individual parameters, they often require expensive upfront experimentation, and they are susceptible to workload changes. In this paper, we explore whether PSs can automatically adapt to the workload without any prior tuning. Our goals are to improve usability and to maintain (or even improve) efficiency. We propose (i) a novel intent signaling mechanism that acts as an enabler for adaptivity and naturally integrates into ML tasks, and (ii) a fully adaptive, zero-tuning PS called AdaPS based on this mechanism. Our experimental evaluation suggests that automatic adaptation to the workload is indeed possible: AdaPS matched or outperformed state-of-the-art PSs out of the box.
    Differentiable programming for functional connectomics. (arXiv:2206.00649v1 [q-bio.NC])
    Mapping the functional connectome has the potential to uncover key insights into brain organisation. However, existing workflows for functional connectomics are limited in their adaptability to new data, and principled workflow design is a challenging combinatorial problem. We introduce a new analytic paradigm and software toolbox that implements common operations used in functional connectomics as fully differentiable processing blocks. Under this paradigm, workflow configurations exist as reparameterisations of a differentiable functional that interpolates them. The differentiable program that we envision occupies a niche midway between traditional pipelines and end-to-end neural networks, combining the glass-box tractability and domain knowledge of the former with the amenability to optimisation of the latter. In this preliminary work, we provide a proof of concept for differentiable connectomics, demonstrating the capacity of our processing blocks both to recapitulate canonical knowledge in neuroscience and to make new discoveries in an unsupervised setting. Our differentiable modules are competitive with state-of-the-art methods in problem domains including functional parcellation, denoising, and covariance modelling. Taken together, our results and software demonstrate the promise of differentiable programming for functional connectomics.
    Normalization effects on shallow neural networks and related asymptotic expansions. (arXiv:2011.10487v3 [stat.ML] UPDATED)
    We consider shallow (single hidden layer) neural networks and characterize their performance when trained with stochastic gradient descent as the number of hidden units $N$ and gradient descent steps grow to infinity. In particular, we investigate the effect of different scaling schemes, which lead to different normalizations of the neural network, on the network's statistical output, closing the gap between the $1/\sqrt{N}$ and the mean-field $1/N$ normalization. We develop an asymptotic expansion for the neural network's statistical output pointwise with respect to the scaling parameter as the number of hidden units grows to infinity. Based on this expansion, we demonstrate mathematically that to leading order in $N$, there is no bias-variance trade off, in that both bias and variance (both explicitly characterized) decrease as the number of hidden units increases and time grows. In addition, we show that to leading order in $N$, the variance of the neural network's statistical output decays as the implied normalization by the scaling parameter approaches the mean field normalization. Numerical studies on the MNIST and CIFAR10 datasets show that test and train accuracy monotonically improve as the neural network's normalization gets closer to the mean field normalization.
    A Hybrid Architecture for Federated and Centralized Learning. (arXiv:2105.03288v3 [cs.LG] UPDATED)
    Many of the machine learning tasks rely on centralized learning (CL), which requires the transmission of local datasets from the clients to a parameter server (PS) entailing huge communication overhead. To overcome this, federated learning (FL) has been suggested as a promising tool, wherein the clients send only the model updates to the PS instead of the whole dataset. However, FL demands powerful computational resources from the clients. In practice, not all the clients have sufficient computational resources to participate in training. To address this common scenario, we propose a more efficient approach called hybrid federated and centralized learning (HFCL), wherein only the clients with sufficient resources employ FL, while the remaining ones send their datasets to the PS, which computes the model on behalf of them. Then, the model parameters are aggregated at the PS. To improve the efficiency of dataset transmission, we propose two different techniques: i) increased computation-per-client and ii) sequential data transmission. Notably, the HFCL frameworks outperform FL with up to 20\% improvement in the learning accuracy when only half of the clients perform FL while having 50\% less communication overhead than CL since all the clients collaborate on the learning process with their datasets.
    Trap of Feature Diversity in the Learning of MLPs. (arXiv:2112.00980v4 [cs.LG] UPDATED)
    In this paper, we focus on a typical two-phase phenomenon in the learning of multi-layer perceptrons (MLPs), and we aim to explain the reason for the decrease of feature diversity in the first phase. Specifically, people find that, in the training of MLPs, the training loss does not decrease significantly until the second phase. To this end, we further explore the reason why the diversity of features over different samples keeps decreasing in the first phase, which hurts the optimization of MLPs. We explain such a phenomenon in terms of the learning dynamics of MLPs. Furthermore, we theoretically explain why four typical operations can alleviate the decrease of the feature diversity.  ( 2 min )
    Bring Your Own Algorithm for Optimal Differentially Private Stochastic Minimax Optimization. (arXiv:2206.00363v1 [cs.LG])
    We study differentially private (DP) algorithms for smooth stochastic minimax optimization, with stochastic minimization as a byproduct. The holy grail of these settings is to guarantee the optimal trade-off between the privacy and the excess population loss, using an algorithm with a linear time-complexity in the number of training samples. We provide a general framework for solving differentially private stochastic minimax optimization (DP-SMO) problems, which enables the practitioners to bring their own base optimization algorithm and use it as a black-box to obtain the near-optimal privacy-loss trade-off. Our framework is inspired from the recently proposed Phased-ERM method [20] for nonsmooth differentially private stochastic convex optimization (DP-SCO), which exploits the stability of the empirical risk minimization (ERM) for the privacy guarantee. The flexibility of our approach enables us to sidestep the requirement that the base algorithm needs to have bounded sensitivity, and allows the use of sophisticated variance-reduced accelerated methods to achieve near-linear time-complexity. To the best of our knowledge, these are the first linear-time optimal algorithms, up to logarithmic factors, for smooth DP-SMO when the objective is (strongly-)convex-(strongly-)concave. Additionally, based on our flexible framework, we derive a new family of near-linear time algorithms for smooth DP-SCO with optimal privacy-loss trade-offs for a wider range of smoothness parameters compared to previous algorithms.
    Adaptive Online Learning of Quantum States. (arXiv:2206.00220v1 [cs.LG])
    In the fundamental problem of shadow tomography, the goal is to efficiently learn an unknown $d$-dimensional quantum state using projective measurements. However, it is rarely the case that the underlying state remains stationary: changes may occur due to measurements, environmental noise, or an underlying Hamiltonian state evolution. In this paper we adopt tools from adaptive online learning to learn a changing state, giving adaptive and dynamic regret bounds for online shadow tomography that are polynomial in the number of qubits and sublinear in the number of measurements. Our analysis utilizes tools from complex matrix analysis to cope with complex numbers, which may be of independent interest in online learning. In addition, we provide numerical experiments that corroborate our theoretical results.
    Predecessor Features. (arXiv:2206.00303v1 [cs.LG])
    Any reinforcement learning system must be able to identify which past events contributed to observed outcomes, a problem known as credit assignment. A common solution to this problem is to use an eligibility trace to assign credit to recency-weighted set of experienced events. However, in many realistic tasks, the set of recently experienced events are only one of the many possible action events that could have preceded the current outcome. This suggests that reinforcement learning can be made more efficient by allowing credit assignment to any viable preceding state, rather than only those most recently experienced. Accordingly, we propose "Predecessor Features", an algorithm that achieves this richer form of credit assignment. By maintaining a representation that approximates the expected sum of past occupancies, our algorithm allows temporal difference (TD) errors to be propagated accurately to a larger number of predecessor states than conventional methods, greatly improving learning speed. Our algorithm can also be naturally extended from tabular state representation to feature representations allowing for increased performance on a wide range of environments. We demonstrate several use cases for Predecessor Features and contrast its performance with other similar approaches.
    Reinforcement Learning with Algorithms from Probabilistic Structure Estimation. (arXiv:2103.08241v2 [cs.LG] UPDATED)
    Reinforcement learning (RL) algorithms aim to learn optimal decisions in unknown environments through experience of taking actions and observing the rewards gained. In some cases, the environment is not influenced by the actions of the RL agent, in which case the problem can be modeled as a contextual multi-armed bandit and lightweight myopic algorithms can be employed. On the other hand, when the RL agent's actions affect the environment, the problem must be modeled as a Markov decision process and more complex RL algorithms are required which take the future effects of actions into account. Moreover, in practice, it is often unknown from the outset whether or not the agent's actions will impact the environment and it is therefore not possible to determine which RL algorithm is most fitting. In this work, we propose to avoid this difficult decision entirely and incorporate a choice mechanism into our RL framework. Rather than assuming a specific problem structure, we use a probabilistic structure estimation procedure based on a likelihood-ratio (LR) test to make a more informed selection of learning algorithm. We derive a sufficient condition under which myopic policies are optimal, present an LR test for this condition, and derive a bound on the regret of our framework. We provide examples of real-world scenarios where our framework is needed and provide extensive simulations to validate our approach.  ( 2 min )
    Augmenting Message Passing by Retrieving Similar Graphs. (arXiv:2206.00362v1 [cs.LG])
    Graph Neural Networks (GNNs) are effective tools for graph representation learning. Most GNNs rely on a recursive neighborhood aggregation scheme, named message passing. In this paper, motivated by the success of retrieval-based models, we propose a non-parametric scheme called GraphRetrieval, in which similar training graphs associated with their ground-truth labels are retrieved to be jointly utilized with the input graph representation to complete various graph-based predictive tasks. In particular, we take a well-trained model with its parameters fixed and then we add an adapter based on self-attention with only a few trainable parameters per task to explicitly learn the interaction between an input graph and its retrieved similar graphs. Our experiments on 12 different datasets involving different tasks (classification and regression) show that GraphRetrieval is able to achieve substantial improvements on all twelve datasets compared to three strong GNN baseline models. Our work demonstrates that GraphRetrieval is a promising augmentation for message passing.
    Towards Generalisable Audio Representations for Audio-Visual Navigation. (arXiv:2206.00393v1 [cs.SD])
    In audio-visual navigation (AVN), an intelligent agent needs to navigate to a constantly sound-making object in complex 3D environments based on its audio and visual perceptions. While existing methods attempt to improve the navigation performance with preciously designed path planning or intricate task settings, none has improved the model generalisation on unheard sounds with task settings unchanged. We thus propose a contrastive learning-based method to tackle this challenge by regularising the audio encoder, where the sound-agnostic goal-driven latent representations can be learnt from various audio signals of different classes. In addition, we consider two data augmentation strategies to enrich the training sounds. We demonstrate that our designs can be easily equipped to existing AVN frameworks to obtain an immediate performance gain (13.4%$\uparrow$ in SPL on Replica and 12.2%$\uparrow$ in SPL on MP3D). Our project is available at https://AV-GeN.github.io/.  ( 2 min )
    Ultrahyperbolic Knowledge Graph Embeddings. (arXiv:2206.00449v1 [cs.LG])
    Recent knowledge graph (KG) embeddings have been advanced by hyperbolic geometry due to its superior capability for representing hierarchies. The topological structures of real-world KGs, however, are rather heterogeneous, i.e., a KG is composed of multiple distinct hierarchies and non-hierarchical graph structures. Therefore, a homogeneous (either Euclidean or hyperbolic) geometry is not sufficient for fairly representing such heterogeneous structures. To capture the topological heterogeneity of KGs, we present an ultrahyperbolic KG embedding (UltraE) in an ultrahyperbolic (or pseudo-Riemannian) manifold that seamlessly interleaves hyperbolic and spherical manifolds. In particular, we model each relation as a pseudo-orthogonal transformation that preserves the pseudo-Riemannian bilinear form. The pseudo-orthogonal transformation is decomposed into various operators (i.e., circular rotations, reflections and hyperbolic rotations), allowing for simultaneously modeling heterogeneous structures as well as complex relational patterns. Experimental results on three standard KGs show that UltraE outperforms previous Euclidean- and hyperbolic-based approaches.  ( 2 min )
    Efficient Scheduling of Data Augmentation for Deep Reinforcement Learning. (arXiv:2206.00518v1 [cs.LG])
    In deep reinforcement learning (RL), data augmentation is widely considered as a tool to induce a set of useful priors about semantic consistency and improve sample efficiency and generalization performance. However, even when the prior is useful for generalization, distilling it to RL agent often interferes with RL training and degenerates sample efficiency. Meanwhile, the agent is forgetful of the prior due to the non-stationary nature of RL. These observations suggest two extreme schedules of distillation: (i) over the entire training; or (ii) only at the end. Hence, we devise a stand-alone network distillation method to inject the consistency prior at any time (even after RL), and a simple yet efficient framework to automatically schedule the distillation. Specifically, the proposed framework first focuses on mastering train environments regardless of generalization by adaptively deciding which {\it or no} augmentation to be used for the training. After this, we add the distillation to extract the remaining benefits for generalization from all the augmentations, which requires no additional new samples. In our experiments, we demonstrate the utility of the proposed framework, in particular, that considers postponing the augmentation to the end of RL training.  ( 2 min )
    Convolutional-Recurrent Neural Network Proxy for Robust Optimization and Closed-Loop Reservoir Management. (arXiv:2203.07524v2 [cs.LG] UPDATED)
    Production optimization under geological uncertainty is computationally expensive, as a large number of well control schedules must be evaluated over multiple geological realizations. In this work, a convolutional-recurrent neural network (CNN-RNN) proxy model is developed to predict well-by-well oil and water rates, for given time-varying well bottom-hole pressure (BHP) schedules, for each realization in an ensemble. This capability enables the estimation of the objective function and nonlinear constraint values required for robust optimization. The proxy model represents an extension of a recently developed long short-term memory (LSTM) RNN proxy designed to predict well rates for a single geomodel. A CNN is introduced here to processes permeability realizations, and this provides the initial states for the RNN. The CNN-RNN proxy is trained using simulation results for 300 different sets of BHP schedules and permeability realizations. We demonstrate proxy accuracy for oil-water flow through multiple realizations of 3D multi-Gaussian permeability models. The proxy is then incorporated into a closed-loop reservoir management (CLRM) workflow, where it is used with particle swarm optimization and a filter-based method for nonlinear constraint satisfaction. History matching is achieved using an adjoint-gradient-based procedure. The proxy model is shown to perform well in this setting for five different (synthetic) `true' models. Improved net present value along with constraint satisfaction and uncertainty reduction are observed with CLRM. For the robust production optimization steps, the proxy provides O(100) runtime speedup over simulation-based optimization.  ( 2 min )
    Width is Less Important than Depth in ReLU Neural Networks. (arXiv:2202.03841v2 [cs.LG] UPDATED)
    We solve an open question from Lu et al. (2017), by showing that any target network with inputs in $\mathbb{R}^d$ can be approximated by a width $O(d)$ network (independent of the target network's architecture), whose number of parameters is essentially larger only by a linear factor. In light of previous depth separation theorems, which imply that a similar result cannot hold when the roles of width and depth are interchanged, it follows that depth plays a more significant role than width in the expressive power of neural networks. We extend our results to constructing networks with bounded weights, and to constructing networks with width at most $d+2$, which is close to the minimal possible width due to previous lower bounds. Both of these constructions cause an extra polynomial factor in the number of parameters over the target network. We also show an exact representation of wide and shallow networks using deep and narrow networks which, in certain cases, does not increase the number of parameters over the target network.  ( 2 min )
    Task-Specific Expert Pruning for Sparse Mixture-of-Experts. (arXiv:2206.00277v1 [cs.LG])
    The sparse Mixture-of-Experts (MoE) model is powerful for large-scale pre-training and has achieved promising results due to its model capacity. However, with trillions of parameters, MoE is hard to be deployed on cloud or mobile environment. The inference of MoE requires expert parallelism, which is not hardware-friendly and communication expensive. Especially for resource-limited downstream tasks, such sparse structure has to sacrifice a lot of computing efficiency for limited performance gains. In this work, we observe most experts contribute scarcely little to the MoE fine-tuning and inference. We further propose a general method to progressively drop the non-professional experts for the target downstream task, which preserves the benefits of MoE while reducing the MoE model into one single-expert dense model. Our experiments reveal that the fine-tuned single-expert model could preserve 99.3% benefits from MoE across six different types of tasks while enjoying 2x inference speed with free communication cost.  ( 2 min )
    Adversarial Attacks on Gaussian Process Bandits. (arXiv:2110.08449v2 [stat.ML] UPDATED)
    Gaussian processes (GP) are a widely-adopted tool used to sequentially optimize black-box functions, where evaluations are costly and potentially noisy. Recent works on GP bandits have proposed to move beyond random noise and devise algorithms robust to adversarial attacks. This paper studies this problem from the attacker's perspective, proposing various adversarial attack methods with differing assumptions on the attacker's strength and prior information. Our goal is to understand adversarial attacks on GP bandits from theoretical and practical perspectives. We focus primarily on targeted attacks on the popular GP-UCB algorithm and a related elimination-based algorithm, based on adversarially perturbing the function $f$ to produce another function $\tilde{f}$ whose optima are in some target region $\mathcal{R}_{\rm target}$. Based on our theoretical analysis, we devise both white-box attacks (known $f$) and black-box attacks (unknown $f$), with the former including a Subtraction attack and Clipping attack, and the latter including an Aggressive subtraction attack. We demonstrate that adversarial attacks on GP bandits can succeed in forcing the algorithm towards $\mathcal{R}_{\rm target}$ even with a low attack budget, and we test our attacks' effectiveness on a diverse range of objective functions.  ( 2 min )
    An $\alpha$-No-Regret Algorithm For Graphical Bilinear Bandits. (arXiv:2206.00466v1 [cs.LG])
    We propose the first regret-based approach to the Graphical Bilinear Bandits problem, where $n$ agents in a graph play a stochastic bilinear bandit game with each of their neighbors. This setting reveals a combinatorial NP-hard problem that prevents the use of any existing regret-based algorithm in the (bi-)linear bandit literature. In this paper, we fill this gap and present the first regret-based algorithm for graphical bilinear bandits using the principle of optimism in the face of uncertainty. Theoretical analysis of this new method yields an upper bound of $\tilde{O}(\sqrt{T})$ on the $\alpha$-regret and evidences the impact of the graph structure on the rate of convergence. Finally, we show through various experiments the validity of our approach.  ( 2 min )
    Learning a performance metric of Buchberger's algorithm. (arXiv:2106.03676v2 [math.AC] UPDATED)
    What can be (machine) learned about the complexity of Buchberger's algorithm? Given a system of polynomials, Buchberger's algorithm computes a Gr\"obner basis of the ideal these polynomials generate using an iterative procedure based on multivariate long division. The runtime of each step of the algorithm is typically dominated by a series of polynomial additions, and the total number of these additions is a hardware independent performance metric that is often used to evaluate and optimize various implementation choices. In this work we attempt to predict, using just the starting input, the number of polynomial additions that take place during one run of Buchberger's algorithm. Good predictions are useful for quickly estimating difficulty and understanding what features make Gr\"obner basis computation hard. Our features and methods could also be used for value models in the reinforcement learning approach to optimize Buchberger's algorithm introduced in [Peifer, Stillman, and Halpern-Leistner, 2020]. We show that a multiple linear regression model built from a set of easy-to-compute ideal generator statistics can predict the number of polynomial additions somewhat well, better than an uninformed model, and better than regression models built on some intuitive commutative algebra invariants that are more difficult to compute. We also train a simple recursive neural network that outperforms these linear models. Our work serves as a proof of concept, demonstrating that predicting the number of polynomial additions in Buchberger's algorithm is a feasible problem from the point of view of machine learning.  ( 2 min )
    Fine Timing and Frequency Synchronization for MIMO-OFDM: An Extreme Learning Approach. (arXiv:2007.09248v5 [eess.SP] UPDATED)
    Multiple-input multiple-output orthogonal frequency-division multiplexing (MIMO-OFDM) is a key technology component in the evolution towards cognitive radio (CR) in next-generation communication in which the accuracy of timing and frequency synchronization significantly impacts the overall system performance. In this paper, we propose a novel scheme leveraging extreme learning machine (ELM) to achieve high-precision synchronization. Specifically, exploiting the preamble signals with synchronization offsets, two ELMs are incorporated into a traditional MIMO-OFDM system to estimate both the residual symbol timing offset (RSTO) and the residual carrier frequency offset (RCFO). The simulation results show that the performance of the proposed ELM-based synchronization scheme is superior to the traditional method under both additive white Gaussian noise (AWGN) and frequency selective fading channels. Furthermore, comparing with the existing machine learning based techniques, the proposed method shows outstanding performance without the requirement of perfect channel state information (CSI) and prohibitive computational complexity. Finally, the proposed method is robust in terms of the choice of channel parameters (e.g., number of paths) and also in terms of "generalization ability" from a machine learning standpoint.  ( 2 min )
    Differentially Private Shapley Values for Data Evaluation. (arXiv:2206.00511v1 [cs.LG])
    The Shapley value has been proposed as a solution to many applications in machine learning, including for equitable valuation of data. Shapley values are computationally expensive and involve the entire dataset. The query for a point's Shapley value can also compromise the statistical privacy of other data points. We observe that in machine learning problems such as empirical risk minimization, and in many learning algorithms (such as those with uniform stability), a diminishing returns property holds, where marginal benefit per data point decreases rapidly with data sample size. Based on this property, we propose a new stratified approximation method called the Layered Shapley Algorithm. We prove that this method operates on small (O(\polylog(n))) random samples of data and small sized ($O(\log n)$) coalitions to achieve the results with guaranteed probabilistic accuracy, and can be modified to incorporate differential privacy. Experimental results show that the algorithm correctly identifies high-value data points that improve validation accuracy, and that the differentially private evaluations preserve approximate ranking of data.  ( 2 min )
    Proximally Sensitive Error for Anomaly Detection and Feature Learning. (arXiv:2206.00506v1 [cs.CV])
    Mean squared error (MSE) is one of the most widely used metrics to expression differences between multi-dimensional entities, including images. However, MSE is not locally sensitive as it does not take into account the spatial arrangement of the (pixel) differences, which matters for structured data types like images. Such spatial arrangements carry information about the source of the differences; therefore, an error function that also incorporates the location of errors can lead to a more meaningful distance measure. We introduce Proximally Sensitive Error (PSE), through which we suggest that a regional emphasis in the error measure can 'highlight' semantic differences between images over syntactic/random deviations. We demonstrate that this emphasis can be leveraged upon for the task of anomaly/occlusion detection. We further explore its utility as a loss function to help a model focus on learning representations of semantic objects instead of minimizing syntactic reconstruction noise.  ( 2 min )
    Computing the Variance of Shuffling Stochastic Gradient Algorithms via Power Spectral Density Analysis. (arXiv:2206.00632v1 [math.OC])
    When solving finite-sum minimization problems, two common alternatives to stochastic gradient descent (SGD) with theoretical benefits are random reshuffling (SGD-RR) and shuffle-once (SGD-SO), in which functions are sampled in cycles without replacement. Under a convenient stochastic noise approximation which holds experimentally, we study the stationary variances of the iterates of SGD, SGD-RR and SGD-SO, whose leading terms decrease in this order, and obtain simple approximations. To obtain our results, we study the power spectral density of the stochastic gradient noise sequences. Our analysis extends beyond SGD to SGD with momentum and to the stochastic Nesterov's accelerated gradient method. We perform experiments on quadratic objective functions to test the validity of our approximation and the correctness of our findings.  ( 2 min )
    Verified Probabilistic Policies for Deep Reinforcement Learning. (arXiv:2201.03698v2 [cs.AI] UPDATED)
    Deep reinforcement learning is an increasingly popular technique for synthesising policies to control an agent's interaction with its environment. There is also growing interest in formally verifying that such policies are correct and execute safely. Progress has been made in this area by building on existing work for verification of deep neural networks and of continuous-state dynamical systems. In this paper, we tackle the problem of verifying probabilistic policies for deep reinforcement learning, which are used to, for example, tackle adversarial environments, break symmetries and manage trade-offs. We propose an abstraction approach, based on interval Markov decision processes, that yields probabilistic guarantees on a policy's execution, and present techniques to build and solve these models using abstract interpretation, mixed-integer linear programming, entropy-based refinement and probabilistic model checking. We implement our approach and illustrate its effectiveness on a selection of reinforcement learning benchmarks.  ( 2 min )
    The Representation Jensen-R\'enyi Divergence. (arXiv:2112.01583v4 [cs.LG] UPDATED)
    We introduce a divergence measure between data distributions based on operators in reproducing kernel Hilbert spaces defined by kernels. The empirical estimator of the divergence is computed using the eigenvalues of positive definite Gram matrices that are obtained by evaluating the kernel over pairs of data points. The new measure shares similar properties to Jensen-Shannon divergence. Convergence of the proposed estimators follows from concentration results based on the difference between the ordered spectrum of the Gram matrices and the integral operators associated with the population quantities. The proposed measure of divergence avoids the estimation of the probability distribution underlying the data. Numerical experiments involving comparing distributions and applications to sampling unbalanced data for classification show that the proposed divergence can achieve state of the art results.  ( 2 min )
    A Near-Optimal Best-of-Both-Worlds Algorithm for Online Learning with Feedback Graphs. (arXiv:2206.00557v1 [cs.LG])
    We consider online learning with feedback graphs, a sequential decision-making framework where the learner's feedback is determined by a directed graph over the action set. We present a computationally efficient algorithm for learning in this framework that simultaneously achieves near-optimal regret bounds in both stochastic and adversarial environments. The bound against oblivious adversaries is $\tilde{O} (\sqrt{\alpha T})$, where $T$ is the time horizon and $\alpha$ is the independence number of the feedback graph. The bound against stochastic environments is $O\big( (\ln T)^2 \max_{S\in \mathcal I(G)} \sum_{i \in S} \Delta_i^{-1}\big)$ where $\mathcal I(G)$ is the family of all independent sets in a suitably defined undirected version of the graph and $\Delta_i$ are the suboptimality gaps. The algorithm combines ideas from the EXP3++ algorithm for stochastic and adversarial bandits and the EXP3.G algorithm for feedback graphs with a novel exploration scheme. The scheme, which exploits the structure of the graph to reduce exploration, is key to obtain best-of-both-worlds guarantees with feedback graphs. We also extend our algorithm and results to a setting where the feedback graphs are allowed to change over time.  ( 2 min )
    DeepCluE: Enhanced Image Clustering via Multi-layer Ensembles in Deep Neural Networks. (arXiv:2206.00359v1 [cs.CV])
    Deep clustering has recently emerged as a promising technique for complex image clustering. Despite the significant progress, previous deep clustering works mostly tend to construct the final clustering by utilizing a single layer of representation, e.g., by performing $K$-means on the last fully-connected layer or by associating some clustering loss to a specific layer. However, few of them have considered the possibilities and potential benefits of jointly leveraging multi-layer representations for enhancing the deep clustering performance. In light of this, this paper presents a Deep Clustering via Ensembles (DeepCluE) approach, which bridges the gap between deep clustering and ensemble clustering by harnessing the power of multiple layers in deep neural networks. Particularly, we utilize a weight-sharing convolutional neural network as the backbone, which is trained with both the instance-level contrastive learning (via an instance projector) and the cluster-level contrastive learning (via a cluster projector) in an unsupervised manner. Thereafter, multiple layers of feature representations are extracted from the trained network, upon which a set of diversified base clusterings can be generated via a highly efficient clusterer. Then, the reliability of the clusters in multiple base clusterings is automatically estimated by exploiting an entropy-based criterion, based on which the multiple base clusterings are further formulated into a weighted-cluster bipartite graph. By partitioning this bipartite graph via transfer cut, the final image clustering result can therefore be obtained. Experimental results on six image datasets confirm the advantages of our DeepCluE approach over the state-of-the-art deep clustering approaches.  ( 2 min )
    Social Bias Meets Data Bias: The Impacts of Labeling and Measurement Errors on Fairness Criteria. (arXiv:2206.00137v1 [cs.LG])
    Although many fairness criteria have been proposed to ensure that machine learning algorithms do not exhibit or amplify our existing social biases, these algorithms are trained on datasets that can themselves be statistically biased. In this paper, we investigate the robustness of a number of existing (demographic) fairness criteria when the algorithm is trained on biased data. We consider two forms of dataset bias: errors by prior decision makers in the labeling process, and errors in measurement of the features of disadvantaged individuals. We analytically show that some constraints (such as Demographic Parity) can remain robust when facing certain statistical biases, while others (such as Equalized Odds) are significantly violated if trained on biased data. We also analyze the sensitivity of these criteria and the decision maker's utility to biases. We provide numerical experiments based on three real-world datasets (the FICO, Adult, and German credit score datasets) supporting our analytical findings. Our findings present an additional guideline for choosing among existing fairness criteria, or for proposing new criteria, when available datasets may be biased.  ( 2 min )
    Learning Instance-Specific Data Augmentations. (arXiv:2206.00051v1 [cs.LG])
    Existing data augmentation methods typically assume independence between transformations and inputs: they use the same transformation distribution for all input instances. We explain why this can be problematic and propose InstaAug, a method for automatically learning input-specific augmentations from data. This is achieved by introducing an augmentation module that maps an input to a distribution over transformations. This is simultaneously trained alongside the base model in a fully end-to-end manner using only the training data. We empirically demonstrate that InstaAug learns meaningful augmentations for a wide range of transformation classes, which in turn provides better performance on supervised and self-supervised tasks compared with augmentations that assume input--transformation independence.  ( 2 min )
    Mario Plays on a Manifold: Generating Functional Content in Latent Space through Differential Geometry. (arXiv:2206.00106v1 [cs.LG])
    Deep generative models can automatically create content of diverse types. However, there are no guarantees that such content will satisfy the criteria necessary to present it to end-users and be functional, e.g. the generated levels could be unsolvable or incoherent. In this paper we study this problem from a geometric perspective, and provide a method for reliable interpolation and random walks in the latent spaces of Categorical VAEs based on Riemannian geometry. We test our method with "Super Mario Bros" and "The Legend of Zelda" levels, and against simpler baselines inspired by current practice. Results show that the geometry we propose is better able to interpolate and sample, reliably staying closer to parts of the latent space that decode to playable content.  ( 2 min )
    CASSOCK: Viable Backdoor Attacks against DNN in The Wall of Source-Specific Backdoor Defences. (arXiv:2206.00145v1 [cs.CR])
    Backdoor attacks have been a critical threat to deep neural network (DNN). However, most existing countermeasures focus on source-agnostic backdoor attacks (SABAs) and fail to defeat source-specific backdoor attacks (SSBAs). Compared to an SABA, an SSBA activates a backdoor when an input from attacker-chosen class(es) is stamped with an attacker-specified trigger, making itself stealthier and thus evade most existing backdoor mitigation. Nonetheless, existing SSBAs have trade-offs on attack success rate (ASR, a backdoor is activated by a trigger input from a source class as expected) and false positive rate (FPR, a backdoor is activated unexpectedly by a trigger input from a non-source class). Significantly, they can still be effectively detected by the state-of-the-art (SOTA) countermeasures targeting SSBAs. This work overcomes efficiency and effectiveness deficiencies of existing SSBAs, thus bypassing the SOTA defences. The key insight is to construct desired poisoned and cover data during backdoor training by characterising SSBAs in-depth. Both data are samples with triggers: the cover/poisoned data from non-source/source class(es) holds ground-truth/target labels. Therefore, two cover/poisoned data enhancements are developed from trigger style and content, respectively, coined CASSOCK. First, we leverage trigger patterns with discrepant transparency to craft cover/poisoned data, enforcing triggers with heterogeneous sensitivity on different classes. The second enhancement chooses the target class features as triggers to craft these samples, entangling trigger features with the target class heavily. Compared with existing SSBAs, CASSOCK-based attacks have higher ASR and low FPR on four popular tasks: MNIST, CIFAR10, GTSRB, and LFW. More importantly, CASSOCK has effectively evaded three defences (SCAn, Februus and extended Neural Cleanse) already defeat existing SSBAs effectively.  ( 2 min )
    Easy Variational Inference for Categorical Models via an Independent Binary Approximation. (arXiv:2206.00093v1 [stat.ML])
    We pursue tractable Bayesian analysis of generalized linear models (GLMs) for categorical data. Thus far, GLMs are difficult to scale to more than a few dozen categories due to non-conjugacy or strong posterior dependencies when using conjugate auxiliary variable methods. We define a new class of GLMs for categorical data called categorical-from-binary (CB) models. Each CB model has a likelihood that is bounded by the product of binary likelihoods, suggesting a natural posterior approximation. This approximation makes inference straightforward and fast; using well-known auxiliary variables for probit or logistic regression, the product of binary models admits conjugate closed-form variational inference that is embarrassingly parallel across categories and invariant to category ordering. Moreover, an independent binary model simultaneously approximates multiple CB models. Bayesian model averaging over these can improve the quality of the approximation for any given dataset. We show that our approach scales to thousands of categories, outperforming posterior estimation competitors like Automatic Differentiation Variational Inference (ADVI) and No U-Turn Sampling (NUTS) in the time required to achieve fixed prediction quality.  ( 2 min )
    A Theoretical Framework for Inference Learning. (arXiv:2206.00164v1 [cs.NE])
    Backpropagation (BP) is the most successful and widely used algorithm in deep learning. However, the computations required by BP are challenging to reconcile with known neurobiology. This difficulty has stimulated interest in more biologically plausible alternatives to BP. One such algorithm is the inference learning algorithm (IL). IL has close connections to neurobiological models of cortical function and has achieved equal performance to BP on supervised learning and auto-associative tasks. In contrast to BP, however, the mathematical foundations of IL are not well-understood. Here, we develop a novel theoretical framework for IL. Our main result is that IL closely approximates an optimization method known as implicit stochastic gradient descent (implicit SGD), which is distinct from the explicit SGD implemented by BP. Our results further show how the standard implementation of IL can be altered to better approximate implicit SGD. Our novel implementation considerably improves the stability of IL across learning rates, which is consistent with our theory, as a key property of implicit SGD is its stability. We provide extensive simulation results that further support our theoretical interpretations and also demonstrate IL achieves quicker convergence when trained with small mini-batches while matching the performance of BP for large mini-batches.  ( 2 min )
    Interpretable Deep Learning Classifier by Detection of Prototypical Parts on Kidney Stones Images. (arXiv:2206.00252v1 [cs.CV])
    Identifying the type of kidney stones can allow urologists to determine their formation cause, improving the early prescription of appropriate treatments to diminish future relapses. However, currently, the associated ex-vivo diagnosis (known as morpho-constitutional analysis, MCA) is time-consuming, expensive, and requires a great deal of experience, as it requires a visual analysis component that is highly operator dependant. Recently, machine learning methods have been developed for in-vivo endoscopic stone recognition. Shallow methods have been demonstrated to be reliable and interpretable but exhibit low accuracy, while deep learning-based methods yield high accuracy but are not explainable. However, high stake decisions require understandable computer-aided diagnosis (CAD) to suggest a course of action based on reasonable evidence, rather than merely prescribe one. Herein, we investigate means for learning part-prototypes (PPs) that enable interpretable models. Our proposal suggests a classification for a kidney stone patch image and provides explanations in a similar way as those used on the MCA method.
    Discovering the Hidden Vocabulary of DALLE-2. (arXiv:2206.00169v1 [cs.LG])
    We discover that DALLE-2 seems to have a hidden vocabulary that can be used to generate images with absurd prompts. For example, it seems that \texttt{Apoploe vesrreaitais} means birds and \texttt{Contarra ccetnxniams luryca tanniounons} (sometimes) means bugs or pests. We find that these prompts are often consistent in isolation but also sometimes in combinations. We present our black-box method to discover words that seem random but have some correspondence to visual concepts. This creates important security and interpretability challenges.
    IGLU Gridworld: Simple and Fast Environment for Embodied Dialog Agents. (arXiv:2206.00142v1 [cs.LG])
    We present the IGLU Gridworld: a reinforcement learning environment for building and evaluating language conditioned embodied agents in a scalable way. The environment features visual agent embodiment, interactive learning through collaboration, language conditioned RL, and combinatorically hard task (3d blocks building) space.  ( 2 min )
    Privacy for Free: How does Dataset Condensation Help Privacy?. (arXiv:2206.00240v1 [cs.CR])
    To prevent unintentional data leakage, research community has resorted to data generators that can produce differentially private data for model training. However, for the sake of the data privacy, existing solutions suffer from either expensive training cost or poor generalization performance. Therefore, we raise the question whether training efficiency and privacy can be achieved simultaneously. In this work, we for the first time identify that dataset condensation (DC) which is originally designed for improving training efficiency is also a better solution to replace the traditional data generators for private data generation, thus providing privacy for free. To demonstrate the privacy benefit of DC, we build a connection between DC and differential privacy, and theoretically prove on linear feature extractors (and then extended to non-linear feature extractors) that the existence of one sample has limited impact ($O(m/n)$) on the parameter distribution of networks trained on $m$ samples synthesized from $n (n \gg m)$ raw samples by DC. We also empirically validate the visual privacy and membership privacy of DC-synthesized data by launching both the loss-based and the state-of-the-art likelihood-based membership inference attacks. We envision this work as a milestone for data-efficient and privacy-preserving machine learning.
    Bounding Membership Inference. (arXiv:2202.12232v2 [cs.LG] UPDATED)
    Differential Privacy (DP) is the de facto standard for reasoning about the privacy guarantees of a training algorithm. Despite the empirical observation that DP reduces the vulnerability of models to existing membership inference (MI) attacks, a theoretical underpinning as to why this is the case is largely missing in the literature. In practice, this means that models need to be trained with DP guarantees that greatly decrease their accuracy. In this paper, we provide a tighter bound on the positive accuracy (i.e., attack precision) of any MI adversary when a training algorithm provides $\epsilon$-DP or $(\epsilon, \delta)$-DP. Our bound informs the design of a novel privacy amplification scheme, where an effective training set is sub-sampled from a larger set prior to the beginning of training, to greatly reduce the bound on MI accuracy. As a result, our scheme enables DP users to employ looser DP guarantees when training their model to limit the success of any MI adversary; this ensures that the model's accuracy is less impacted by the privacy guarantee. Finally, we discuss implications of our MI bound on the field of machine unlearning.
    Automatic differentiation of nonsmooth iterative algorithms. (arXiv:2206.00457v1 [math.OC])
    Differentiation along algorithms, i.e., piggyback propagation of derivatives, is now routinely used to differentiate iterative solvers in differentiable programming. Asymptotics is well understood for many smooth problems but the nondifferentiable case is hardly considered. Is there a limiting object for nonsmooth piggyback automatic differentiation (AD)? Does it have any variational meaning and can it be used effectively in machine learning? Is there a connection with classical derivative? All these questions are addressed under appropriate nonexpansivity conditions in the framework of conservative derivatives which has proved useful in understanding nonsmooth AD. For nonsmooth piggyback iterations, we characterize the attractor set of nonsmooth piggyback iterations as a set-valued fixed point which remains in the conservative framework. This has various consequences and in particular almost everywhere convergence of classical derivatives. Our results are illustrated on parametric convex optimization problems with forward-backward, Douglas-Rachford and Alternating Direction of Multiplier algorithms as well as the Heavy-Ball method.
    Deep learning pipeline for image classification on mobile phones. (arXiv:2206.00105v1 [eess.IV])
    This article proposes and documents a machine-learning framework and tutorial for classifying images using mobile phones. Compared to computers, the performance of deep learning model performance degrades when deployed on a mobile phone and requires a systematic approach to find a model that performs optimally on both computers and mobile phones. By following the proposed pipeline, which consists of various computational tools, simple procedural recipes, and technical considerations, one can bring the power of deep learning medical image classification to mobile devices, potentially unlocking new domains of applications. The pipeline is demonstrated on four different publicly available datasets: COVID X-rays, COVID CT scans, leaves, and colorectal cancer. We used two application development frameworks: TensorFlow Lite (real-time testing) and Flutter (digital image testing) to test the proposed pipeline. We found that transferring deep learning models to a mobile phone is limited by hardware and classification accuracy drops. To address this issue, we proposed this pipeline to find an optimized model for mobile phones. Finally, we discuss additional applications and computational concerns related to deploying deep-learning models on phones, including real-time analysis and image preprocessing. We believe the associated documentation and code can help physicians and medical experts develop medical image classification applications for distribution.  ( 2 min )
    Universal Early Warning Signals of Phase Transitions in Climate Systems. (arXiv:2206.00060v1 [physics.ao-ph])
    The potential for complex systems to exhibit tipping points in which an equilibrium state undergoes a sudden and potentially irreversible shift is well established, but prediction of these events using standard forecast modeling techniques is quite difficult. This has led to the development of an alternative suite of methods that seek to identify signatures of critical phenomena in data, which are expected to occur in advance of many classes of dynamical bifurcation. Crucially, the manifestations of these critical phenomena are generic across a variety of systems, meaning that data-intensive deep learning methods can be trained on (abundant) synthetic data and plausibly prove effective when transferred to (more limited) empirical data sets. This paper provides a proof of concept for this approach as applied to lattice phase transitions: a deep neural network trained exclusively on 2D Ising model phase transitions is tested on a number of real and simulated climate systems with considerable success. Its accuracy frequently surpasses that of conventional statistical indicators, with performance shown to be consistently improved by the inclusion of spatial indicators. Tools such as this may offer valuable insight into climate tipping events, as remote sensing measurements provide increasingly abundant data on complex geospatially-resolved Earth systems.  ( 2 min )
    Provably and Practically Efficient Neural Contextual Bandits. (arXiv:2206.00099v1 [stat.ML])
    We consider the neural contextual bandit problem. In contrast to the existing work which primarily focuses on ReLU neural nets, we consider a general set of smooth activation functions. Under this more general setting, (i) we derive non-asymptotic error bounds on the difference between an overparameterized neural net and its corresponding neural tangent kernel, (ii) we propose an algorithm with a provably sublinear regret bound that is also efficient in the finite regime as demonstrated by empirical studies. The non-asymptotic error bounds may be of broader interest as a tool to establish the relation between the smoothness of the activation functions in neural contextual bandits and the smoothness of the kernels in kernel bandits.  ( 2 min )
    Are classical neural networks quantum?. (arXiv:2206.00005v1 [cs.LG])
    Neural networks are being used to improve the probing of the state spaces of many particle systems as approximations to wavefunctions and in order to avoid the recurring sign problem of quantum monte-carlo. One may ask whether the usual classical neural networks have some actual hidden quantum properties that make them such suitable tools for a highly coupled quantum problem. I discuss here what makes a system quantum and to what extent we can interpret a neural network as having quantum remnants.  ( 2 min )
    Semantically-enhanced Topic Recommendation System for Software Projects. (arXiv:2206.00085v1 [cs.SE])
    Software-related platforms have enabled their users to collaboratively label software entities with topics. Tagging software repositories with relevant topics can be exploited for facilitating various downstream tasks. For instance, a correct and complete set of topics assigned to a repository can increase its visibility. Consequently, this improves the outcome of tasks such as browsing, searching, navigation, and organization of repositories. Unfortunately, assigned topics are usually highly noisy, and some repositories do not have well-assigned topics. Thus, there have been efforts on recommending topics for software projects, however, the semantic relationships among these topics have not been exploited so far. We propose two recommender models for tagging software projects that incorporate the semantic relationship among topics. Our approach has two main phases; (1) we first take a collaborative approach to curate a dataset of quality topics specifically for the domain of software engineering and development. We also enrich this data with the semantic relationships among these topics and encapsulate them in a knowledge graph we call SED-KGraph. Then, (2) we build two recommender systems; The first one operates only based on the list of original topics assigned to a repository and the relationships specified in our knowledge graph. The second predictive model, however, assumes there are no topics available for a repository, hence it proceeds to predict the relevant topics based on both textual information of a software project and SED-KGraph. We built SED-KGraph in a crowd-sourced project with 170 contributors from both academia and industry. The experiment results indicate that our solutions outperform baselines that neglect the semantic relationships among topics by at least 25% and 23% in terms of ASR and MAP metrics.  ( 2 min )
    Extensive Study of Multiple Deep Neural Networks for Complex Random Telegraph Signals. (arXiv:2206.00086v1 [physics.app-ph])
    Time-fluctuating signals are ubiquitous and diverse in many physical, chemical, and biological systems, among which random telegraph signals (RTSs) refer to a series of instantaneous switching events between two discrete levels from single-particle movements. Reliable RTS analyses are crucial prerequisite to identify underlying mechanisms related to performance sensitivity. When numerous levels partake, complex patterns of multilevel RTSs occur, making their quantitative analysis exponentially difficult, hereby systematic approaches are found elusive. Here, we present a three-step analysis protocol via progressive knowledge-transfer, where the outputs of early step are passed onto a subsequent step. Especially, to quantify complex RTSs, we build three deep neural network architectures that can process temporal data well and demonstrate the model accuracy extensively with a large dataset of different RTS types affected by controlling background noise size. Our protocol offers structured schemes to quantify complex RTSs from which meaningful interpretation and inference can ensue.  ( 2 min )
    End-to-end Optimization of Machine Learning Prediction Queries. (arXiv:2206.00136v1 [cs.DB])
    Prediction queries are widely used across industries to perform advanced analytics and draw insights from data. They include a data processing part (e.g., for joining, filtering, cleaning, featurizing the datasets) and a machine learning (ML) part invoking one or more trained models to perform predictions. These parts have so far been optimized in isolation, leaving significant opportunities for optimization unexplored. We present Raven, a production-ready system for optimizing prediction queries. Raven follows the enterprise architectural trend of collocating data and ML runtimes. It relies on a unified intermediate representation that captures both data and ML operators in a single graph structure to unlock two families of optimizations. First, it employs logical optimizations that pass information between the data part (and the properties of the underlying data) and the ML part to optimize each other. Second, it introduces logical-to-physical transformations that allow operators to be executed on different runtimes (relational, ML, and DNN) and hardware (CPU, GPU). Novel data-driven optimizations determine the runtime to be used for each part of the query to achieve optimal performance. Our evaluation shows that Raven improves performance of prediction queries on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems.  ( 2 min )
    Communication-efficient distributed eigenspace estimation with arbitrary node failures. (arXiv:2206.00127v1 [stat.ML])
    We develop an eigenspace estimation algorithm for distributed environments with arbitrary node failures, where a subset of computing nodes can return structurally valid but otherwise arbitrarily chosen responses. Notably, this setting encompasses several important scenarios that arise in distributed computing and data-collection environments such as silent/soft errors, outliers or corrupted data at certain nodes, and adversarial responses. Our estimator builds upon and matches the performance of a recently proposed non-robust estimator up to an additive $\tilde{O}(\sigma \sqrt{\alpha})$ error, where $\sigma^2$ is the variance of the existing estimator and $\alpha$ is the fraction of corrupted nodes.  ( 2 min )
    Generative Models with Information-Theoretic Protection Against Membership Inference Attacks. (arXiv:2206.00071v1 [cs.LG])
    Deep generative models, such as Generative Adversarial Networks (GANs), synthesize diverse high-fidelity data samples by estimating the underlying distribution of high dimensional data. Despite their success, GANs may disclose private information from the data they are trained on, making them susceptible to adversarial attacks such as membership inference attacks, in which an adversary aims to determine if a record was part of the training set. We propose an information theoretically motivated regularization term that prevents the generative model from overfitting to training data and encourages generalizability. We show that this penalty minimizes the JensenShannon divergence between components of the generator trained on data with different membership, and that it can be implemented at low cost using an additional classifier. Our experiments on image datasets demonstrate that with the proposed regularization, which comes at only a small added computational cost, GANs are able to preserve privacy and generate high-quality samples that achieve better downstream classification performance compared to non-private and differentially private generative models.  ( 2 min )
    PandA: Unsupervised Learning of Parts and Appearances in the Feature Maps of GANs. (arXiv:2206.00048v1 [cs.CV])
    Recent advances in the understanding of Generative Adversarial Networks (GANs) have led to remarkable progress in visual editing and synthesis tasks, capitalizing on the rich semantics that are embedded in the latent spaces of pre-trained GANs. However, existing methods are often tailored to specific GAN architectures and are limited to either discovering global semantic directions that do not facilitate localized control, or require some form of supervision through manually provided regions or segmentation masks. In this light, we present an architecture-agnostic approach that jointly discovers factors representing spatial parts and their appearances in an entirely unsupervised fashion. These factors are obtained by applying a semi-nonnegative tensor factorization on the feature maps, which in turn enables context-aware local image editing with pixel-level control. In addition, we show that the discovered appearance factors correspond to saliency maps that localize concepts of interest, without using any labels. Experiments on a wide range of GAN architectures and datasets show that, in comparison to the state of the art, our method is far more efficient in terms of training time and, most importantly, provides much more accurate localized control. Our code is available at: https://github.com/james-oldfield/PandA.  ( 2 min )
    Near-Optimal Collaborative Learning in Bandits. (arXiv:2206.00121v1 [cs.LG])
    This paper introduces a general multi-agent bandit model in which each agent is facing a finite set of arms and may communicate with other agents through a central controller in order to identify, in pure exploration, or play, in regret minimization, its optimal arm. The twist is that the optimal arm for each agent is the arm with largest expected mixed reward, where the mixed reward of an arm is a weighted sum of the rewards of this arm for all agents. This makes communication between agents often necessary. This general setting allows to recover and extend several recent models for collaborative bandit learning, including the recently proposed federated learning with personalization (Shi et al., 2021). In this paper, we provide new lower bounds on the sample complexity of pure exploration and on the regret. We then propose a near-optimal algorithm for pure exploration. This algorithm is based on phased elimination with two novel ingredients: a data-dependent sampling scheme within each phase, aimed at matching a relaxation of the lower bound.  ( 2 min )
    A Cross-City Federated Transfer Learning Framework: A Case Study on Urban Region Profiling. (arXiv:2206.00007v1 [cs.LG])
    Data insufficiency problem (i.e., data missing and label scarcity issues) caused by inadequate services and infrastructures or unbalanced development levels of cities has seriously affected the urban computing tasks in real scenarios. Prior transfer learning methods inspire an elegant solution to the data insufficiency, but are only concerned with one kind of insufficiency issue and fail to fully explore these two issues existing in the real world. In addition, cross-city transfer in existing methods overlooks the inter-city data privacy which is a public concern in practical application. To address the above challenging problems, we propose a novel Cross-city Federated Transfer Learning framework (CcFTL) to cope with the data insufficiency and privacy problems. Concretely, CcFTL transfers the relational knowledge from multiple rich-data source cities to the target city. Besides, the model parameters specific to the target task are firstly trained on the source data and then fine-tuned to the target city by parameter transfer. With our adaptation of federated training and homomorphic encryption settings, CcFTL can effectively deal with the data privacy problem among cities. We take the urban region profiling as an application of smart cities and evaluate the proposed method with a real-world study. The experiments demonstrate the notable superiority of our framework over several competitive state-of-the-art models.  ( 2 min )
    Principle of Relevant Information for Graph Sparsification. (arXiv:2206.00118v1 [cs.LG])
    Graph sparsification aims to reduce the number of edges of a graph while maintaining its structural properties. In this paper, we propose the first general and effective information-theoretic formulation of graph sparsification, by taking inspiration from the Principle of Relevant Information (PRI). To this end, we extend the PRI from a standard scalar random variable setting to structured data (i.e., graphs). Our Graph-PRI objective is achieved by operating on the graph Laplacian, made possible by expressing the graph Laplacian of a subgraph in terms of a sparse edge selection vector $\mathbf{w}$. We provide both theoretical and empirical justifications on the validity of our Graph-PRI approach. We also analyze its analytical solutions in a few special cases. We finally present three representative real-world applications, namely graph sparsification, graph regularized multi-task learning, and medical imaging-derived brain network classification, to demonstrate the effectiveness, the versatility and the enhanced interpretability of our approach over prevalent sparsification techniques. Code of Graph-PRI is available at https://github.com/SJYuCNEL/PRI-Graphs  ( 2 min )
    Weight Set Decomposition for Weighted Rank Aggregation: An interpretable and visual decision support tool. (arXiv:2206.00001v1 [cs.IR])
    The problem of interpreting or aggregating multiple rankings is common to many real-world applications. Perhaps the simplest and most common approach is a weighted rank aggregation, wherein a (convex) weight is applied to each input ranking and then ordered. This paper describes a new tool for visualizing and displaying ranking information for the weighted rank aggregation method. Traditionally, the aim of rank aggregation is to summarize the information from the input rankings and provide one final ranking that hopefully represents a more accurate or truthful result than any one input ranking. While such an aggregated ranking is, and clearly has been, useful to many applications, it also obscures information. In this paper, we show the wealth of information that is available for the weighted rank aggregation problem due to its structure. We apply weight set decomposition to the set of convex multipliers, study the properties useful for understanding this decomposition, and visualize the indifference regions. This methodology reveals information--that is otherwise collapsed by the aggregated ranking--into a useful, interpretable, and intuitive decision support tool. Included are multiple illustrative examples, along with heuristic and exact algorithms for computing the weight set decomposition.  ( 2 min )
    FiLM-Ensemble: Probabilistic Deep Learning via Feature-wise Linear Modulation. (arXiv:2206.00050v1 [cs.LG])
    The ability to estimate epistemic uncertainty is often crucial when deploying machine learning in the real world, but modern methods often produce overconfident, uncalibrated uncertainty predictions. A common approach to quantify epistemic uncertainty, usable across a wide class of prediction models, is to train a model ensemble. In a naive implementation, the ensemble approach has high computational cost and high memory demand. This challenges in particular modern deep learning, where even a single deep network is already demanding in terms of compute and memory, and has given rise to a number of attempts to emulate the model ensemble without actually instantiating separate ensemble members. We introduce FiLM-Ensemble, a deep, implicit ensemble method based on the concept of Feature-wise Linear Modulation (FiLM). That technique was originally developed for multi-task learning, with the aim of decoupling different tasks. We show that the idea can be extended to uncertainty quantification: by modulating the network activations of a single deep network with FiLM, one obtains a model ensemble with high diversity, and consequently well-calibrated estimates of epistemic uncertainty, with low computational overhead in comparison. Empirically, FiLM-Ensemble outperforms other implicit ensemble methods, and it and comes very close to the upper bound of an explicit ensemble of networks (sometimes even beating it), at a fraction of the memory cost.  ( 2 min )
    On Analyzing Generative and Denoising Capabilities of Diffusion-based Deep Generative Models. (arXiv:2206.00070v1 [cs.LG])
    Diffusion-based Deep Generative Models (DDGMs) offer state-of-the-art performance in generative modeling. Their main strength comes from their unique setup in which a model (the backward diffusion process) is trained to reverse the forward diffusion process, which gradually adds noise to the input signal. Although DDGMs are well studied, it is still unclear how the small amount of noise is transformed during the backward diffusion process. Here, we focus on analyzing this problem to gain more insight into the behavior of DDGMs and their denoising and generative capabilities. We observe a fluid transition point that changes the functionality of the backward diffusion process from generating a (corrupted) image from noise to denoising the corrupted image to the final sample. Based on this observation, we postulate to divide a DDGM into two parts: a denoiser and a generator. The denoiser could be parameterized by a denoising auto-encoder, while the generator is a diffusion-based model with its own set of parameters. We experimentally validate our proposition, showing its pros and cons.  ( 2 min )
    Online PAC-Bayes Learning. (arXiv:2206.00024v1 [cs.LG])
    Most PAC-Bayesian bounds hold in the batch learning setting where data is collected at once, prior to inference or prediction. This somewhat departs from many contemporary learning problems where data streams are collected and the algorithms must dynamically adjust. We prove new PAC-Bayesian bounds in this online learning framework, leveraging an updated definition of regret, and we revisit classical PAC-Bayesian results with a batch-to-online conversion, extending their remit to the case of dependent data. Our results hold for bounded losses, potentially \emph{non-convex}, paving the way to promising developments in online learning.  ( 2 min )
    Evolving Domain Generalization. (arXiv:2206.00047v1 [cs.LG])
    Domain generalization aims to learn a predictive model from multiple different but related source tasks that can generalize well to a target task without the need of accessing any target data. Existing domain generalization methods ignore the relationship between tasks, implicitly assuming that all the tasks are sampled from a stationary environment. Therefore, they can fail when deployed in an evolving environment. To this end, we formulate and study the \emph{evolving domain generalization} (EDG) scenario, which exploits not only the source data but also their evolving pattern to generate a model for the unseen task. Our theoretical result reveals the benefits of modeling the relation between two consecutive tasks by learning a globally consistent directional mapping function. In practice, our analysis also suggests solving the DDG problem in a meta-learning manner, which leads to \emph{directional prototypical network}, the first method for the DDG problem. Empirical evaluation of both synthetic and real-world data sets validates the effectiveness of our approach.  ( 2 min )
    COIN: Co-Cluster Infomax for Bipartite Graphs. (arXiv:2206.00006v1 [cs.LG])
    Bipartite graphs are powerful data structures to model interactions between two types of nodes, which have been used in a variety of applications, such as recommender systems, information retrieval, and drug discovery. A fundamental challenge for bipartite graphs is how to learn informative node embeddings. Despite the success of recent self-supervised learning methods on bipartite graphs, their objectives are discriminating instance-wise positive and negative node pairs, which could contain cluster-level errors. In this paper, we introduce a novel co-cluster infomax (COIN) framework, which captures the cluster-level information by maximizing the mutual information of co-clusters. Different from previous infomax methods which estimate mutual information by neural networks, COIN could easily calculate mutual information. Besides, COIN is an end-to-end co-clustering method which can be trained jointly with other objective functions and optimized via back-propagation. Furthermore, we also provide theoretical analysis for COIN. We theoretically prove that COIN is able to effectively maximize the mutual information of node embeddings and COIN is upper-bounded by the prior distributions of nodes. We extensively evaluate the proposed COIN framework on various benchmark datasets and tasks to demonstrate the effectiveness of COIN.  ( 2 min )
    Distributed Graph Neural Network Training with Periodic Historical Embedding Synchronization. (arXiv:2206.00057v1 [cs.LG])
    Despite the recent success of Graph Neural Networks (GNNs), it remains challenging to train a GNN on large graphs, which are prevalent in various applications such as social network, recommender systems, and knowledge graphs. Traditional sampling-based methods accelerate GNN by dropping edges and nodes, which impairs the graph integrity and model performance. Differently, distributed GNN algorithms, which accelerate GNN training by utilizing multiple computing devices, can be classified into two types: "partition-based" methods enjoy low communication costs but suffer from information loss due to dropped edges, while "propagation-based" methods avoid information loss but suffer prohibitive communication overhead. To jointly address these problems, this paper proposes DIstributed Graph Embedding SynchronizaTion (DIGEST), a novel distributed GNN training framework that synergizes the complementary strength of both categories of existing methods. During subgraph parallel training, we propose to let each device store the historical embedding of its neighbors in other subgraphs. Therefore, our method does not discard any neighbors in other subgraphs, nor does it updates them intensively. This effectively avoids (1) the intensive computation on explosively-increasing neighbors and (2) excessive communications across different devices. We proved that the approximation error induced by the staleness of historical embedding can be upper bounded and it does NOT affect the GNN model's expressiveness. More importantly, our convergence analysis demonstrates that DIGEST enjoys a state-of-the-art convergence rate. Extensive experimental evaluation on large, real-world graph datasets shows that DIGEST achieves up to $21.82\times$ speedup without compromising the performance compared to state-of-the-art distributed GNN training frameworks.  ( 2 min )
  • Open

    Asymptotics of Network Embeddings Learned via Subsampling. (arXiv:2107.02363v2 [stat.ML] UPDATED)
    Network data are ubiquitous in modern machine learning, with tasks of interest including node classification, node clustering and link prediction. A frequent approach begins by learning an Euclidean embedding of the network, to which algorithms developed for vector-valued data are applied. For large networks, embeddings are learned using stochastic gradient methods where the sub-sampling scheme can be freely chosen. Despite the strong empirical performance of such methods, they are not well understood theoretically. Our work encapsulates representation methods using a subsampling approach, such as node2vec, into a single unifying framework. We prove, under the assumption that the graph is exchangeable, that the distribution of the learned embedding vectors asymptotically decouples. Moreover, we characterize the asymptotic distribution and provided rates of convergence, in terms of the latent parameters, which includes the choice of loss function and the embedding dimension. This provides a theoretical foundation to understand what the embedding vectors represent and how well these methods perform on downstream tasks. Notably, we observe that typically used loss functions may lead to shortcomings, such as a lack of Fisher consistency.  ( 2 min )
    Online PAC-Bayes Learning. (arXiv:2206.00024v1 [cs.LG])
    Most PAC-Bayesian bounds hold in the batch learning setting where data is collected at once, prior to inference or prediction. This somewhat departs from many contemporary learning problems where data streams are collected and the algorithms must dynamically adjust. We prove new PAC-Bayesian bounds in this online learning framework, leveraging an updated definition of regret, and we revisit classical PAC-Bayesian results with a batch-to-online conversion, extending their remit to the case of dependent data. Our results hold for bounded losses, potentially \emph{non-convex}, paving the way to promising developments in online learning.
    An $\alpha$-No-Regret Algorithm For Graphical Bilinear Bandits. (arXiv:2206.00466v1 [cs.LG])
    We propose the first regret-based approach to the Graphical Bilinear Bandits problem, where $n$ agents in a graph play a stochastic bilinear bandit game with each of their neighbors. This setting reveals a combinatorial NP-hard problem that prevents the use of any existing regret-based algorithm in the (bi-)linear bandit literature. In this paper, we fill this gap and present the first regret-based algorithm for graphical bilinear bandits using the principle of optimism in the face of uncertainty. Theoretical analysis of this new method yields an upper bound of $\tilde{O}(\sqrt{T})$ on the $\alpha$-regret and evidences the impact of the graph structure on the rate of convergence. Finally, we show through various experiments the validity of our approach.
    AgraSSt: Approximate Graph Stein Statistics for Interpretable Assessment of Implicit Graph Generators. (arXiv:2203.03673v2 [stat.ML] UPDATED)
    We propose and analyse a novel statistical procedure, coined AgraSSt, to assess the quality of graph generators that may not be available in explicit form. In particular, AgraSSt can be used to determine whether a learnt graph generating process is capable of generating graphs that resemble a given input graph. Inspired by Stein operators for random graphs, the key idea of AgraSSt is the construction of a kernel discrepancy based on an operator obtained from the graph generator. AgraSSt can provide interpretable criticisms for a graph generator training procedure and help identify reliable sample batches for downstream tasks. Using Stein`s method we give theoretical guarantees for a broad class of random graph models. We provide empirical results on both synthetic input graphs with known graph generation procedures, and real-world input graphs that the state-of-the-art (deep) generative models for graphs are trained on.
    OOD Link Prediction Generalization Capabilities of Message-Passing GNNs in Larger Test Graphs. (arXiv:2205.15117v2 [cs.LG] UPDATED)
    This work provides the first theoretical study on the ability of graph Message Passing Neural Networks (gMPNNs) -- such as Graph Neural Networks (GNNs) -- to perform inductive out-of-distribution (OOD) link prediction tasks, where deployment (test) graph sizes are larger than training graphs. We first prove non-asymptotic bounds showing that link predictors based on permutation-equivariant (structural) node embeddings obtained by gMPNNs can converge to a random guess as test graphs get larger. We then propose a theoretically-sound gMPNN that outputs structural pairwise (2-node) embeddings and prove non-asymptotic bounds showing that, as test graphs grow, these embeddings converge to embeddings of a continuous function that retains its ability to predict links OOD. Empirical results on random graphs show agreement with our theoretical results.
    Asymptotic Properties for Bayesian Neural Network in Besov Space. (arXiv:2206.00241v1 [stat.ML])
    Neural networks have shown great predictive power when dealing with various unstructured data such as images and natural languages. The Bayesian neural network captures the uncertainty of prediction by putting a prior distribution for the parameter of the model and computing the posterior distribution. In this paper, we show that the Bayesian neural network using spike-and-slab prior has consistency with nearly minimax convergence rate when the true regression function is in the Besov space. Even when the smoothness of the regression function is unknown the same posterior convergence rate holds and thus the spike and slab prior is adaptive to the smoothness of the regression function. We also consider the shrinkage prior and show that it has the same convergence rate. In other words, we propose a practical Bayesian neural network with guaranteed asymptotic properties.
    Computing the Variance of Shuffling Stochastic Gradient Algorithms via Power Spectral Density Analysis. (arXiv:2206.00632v1 [math.OC])
    When solving finite-sum minimization problems, two common alternatives to stochastic gradient descent (SGD) with theoretical benefits are random reshuffling (SGD-RR) and shuffle-once (SGD-SO), in which functions are sampled in cycles without replacement. Under a convenient stochastic noise approximation which holds experimentally, we study the stationary variances of the iterates of SGD, SGD-RR and SGD-SO, whose leading terms decrease in this order, and obtain simple approximations. To obtain our results, we study the power spectral density of the stochastic gradient noise sequences. Our analysis extends beyond SGD to SGD with momentum and to the stochastic Nesterov's accelerated gradient method. We perform experiments on quadratic objective functions to test the validity of our approximation and the correctness of our findings.
    Width is Less Important than Depth in ReLU Neural Networks. (arXiv:2202.03841v2 [cs.LG] UPDATED)
    We solve an open question from Lu et al. (2017), by showing that any target network with inputs in $\mathbb{R}^d$ can be approximated by a width $O(d)$ network (independent of the target network's architecture), whose number of parameters is essentially larger only by a linear factor. In light of previous depth separation theorems, which imply that a similar result cannot hold when the roles of width and depth are interchanged, it follows that depth plays a more significant role than width in the expressive power of neural networks. We extend our results to constructing networks with bounded weights, and to constructing networks with width at most $d+2$, which is close to the minimal possible width due to previous lower bounds. Both of these constructions cause an extra polynomial factor in the number of parameters over the target network. We also show an exact representation of wide and shallow networks using deep and narrow networks which, in certain cases, does not increase the number of parameters over the target network.
    Top-down inference in an early visual cortex inspired hierarchical Variational Autoencoder. (arXiv:2206.00436v1 [q-bio.NC])
    Interpreting computations in the visual cortex as learning and inference in a generative model of the environment has received wide support both in neuroscience and cognitive science. However, hierarchical computations, a hallmark of visual cortical processing, has remained impervious for generative models because of a lack of adequate tools to address it. Here we capitalize on advances in Variational Autoencoders (VAEs) to investigate the early visual cortex with sparse coding hierarchical VAEs trained on natural images. We design alternative architectures that vary both in terms of the generative and the recognition components of the two latent-layer VAE. We show that representations similar to the one found in the primary and secondary visual cortices naturally emerge under mild inductive biases. Importantly, a nonlinear representation for texture-like patterns is a stable property of the high-level latent space resistant to the specific architecture of the VAE, reminiscent of the secondary visual cortex. We show that a neuroscience-inspired choice of the recognition model, which features a top-down processing component is critical for two signatures of computations with generative models: learning higher order moments of the posterior beyond the mean and image inpainting. Patterns in higher order response statistics provide inspirations for neuroscience to interpret response correlations and for machine learning to evaluate the learned representations through more detailed characterization of the posterior.
    A Kernelised Stein Statistic for Assessing Implicit Generative Models. (arXiv:2206.00149v1 [stat.ML])
    Synthetic data generation has become a key ingredient for training machine learning procedures, addressing tasks such as data augmentation, analysing privacy-sensitive data, or visualising representative samples. Assessing the quality of such synthetic data generators hence has to be addressed. As (deep) generative models for synthetic data often do not admit explicit probability distributions, classical statistical procedures for assessing model goodness-of-fit may not be applicable. In this paper, we propose a principled procedure to assess the quality of a synthetic data generator. The procedure is a kernelised Stein discrepancy (KSD)-type test which is based on a non-parametric Stein operator for the synthetic data generator of interest. This operator is estimated from samples which are obtained from the synthetic data generator and hence can be applied even when the model is only implicit. In contrast to classical testing, the sample size from the synthetic data generator can be as large as desired, while the size of the observed data, which the generator aims to emulate is fixed. Experimental results on synthetic distributions and trained generative models on synthetic and real datasets illustrate that the method shows improved power performance compared to existing approaches.
    Standardisation-function Kernel Stein Discrepancy: A Unifying View on Kernel Stein Discrepancy Tests for Goodness-of-fit. (arXiv:2106.12105v2 [stat.ME] UPDATED)
    Non-parametric goodness-of-fit testing procedures based on kernel Stein discrepancies (KSD) are promising approaches to validate general unnormalised distributions in various scenarios. Existing works focused on studying kernel choices to boost test performances. However, the choices of (non-unique) Stein operators also have considerable effect on the test performances. Inspired by the standardisation technique that was originally developed to better derive approximation properties for normal distributions, we present a unifying framework, called standardisation-function kernel Stein discrepancy (Sf-KSD), to study different Stein operators in KSD-based tests for goodness-of-fit. We derive explicitly how the proposed framework relates to existing KSD-based tests and show that Sf-KSD can be used as a guide to develop novel kernel-based non-parametric tests on complex data scenarios, e.g. truncated distributions or compositional data. Experimental results demonstrate that the proposed tests control type-I error well and achieve higher test power than existing approaches.
    A model aggregation approach for high-dimensional large-scale optimization. (arXiv:2205.07525v2 [cs.LG] UPDATED)
    Bayesian optimization (BO) has been widely used in machine learning and simulation optimization. With the increase in computational resources and storage capacities in these fields, high-dimensional and large-scale problems are becoming increasingly common. In this study, we propose a model aggregation method in the Bayesian optimization (MamBO) algorithm for efficiently solving high-dimensional large-scale optimization problems. MamBO uses a combination of subsampling and subspace embeddings to collectively address high dimensionality and large-scale issues; in addition, a model aggregation method is employed to address the surrogate model uncertainty issue that arises when embedding is applied. This surrogate model uncertainty issue is largely ignored in the embedding literature and practice, and it is exacerbated when the problem is high-dimensional and data are limited. Our proposed model aggregation method reduces these lower-dimensional surrogate model risks and improves the robustness of the BO algorithm. We derive an asymptotic bound for the proposed aggregated surrogate model and prove the convergence of MamBO. Benchmark numerical experiments indicate that our algorithm achieves superior or comparable performance to other commonly used high-dimensional BO algorithms. Moreover, we apply MamBO to a cascade classifier of a machine learning algorithm for face detection, and the results reveal that MamBO finds settings that achieve higher classification accuracy than the benchmark settings and is computationally faster than other high-dimensional BO algorithms.
    Provably Efficient Lifelong Reinforcement Learning with Linear Function Approximation. (arXiv:2206.00270v1 [cs.LG])
    We study lifelong reinforcement learning (RL) in a regret minimization setting of linear contextual Markov decision process (MDP), where the agent needs to learn a multi-task policy while solving a streaming sequence of tasks. We propose an algorithm, called UCB Lifelong Value Distillation (UCBlvd), that provably achieves sublinear regret for any sequence of tasks, which may be adaptively chosen based on the agent's past behaviors. Remarkably, our algorithm uses only sublinear number of planning calls, which means that the agent eventually learns a policy that is near optimal for multiple tasks (seen or unseen) without the need of deliberate planning. A key to this property is a new structural assumption that enables computation sharing across tasks during exploration. Specifically, for $K$ task episodes of horizon $H$, our algorithm has a regret bound $\tilde{\mathcal{O}}(\sqrt{(d^3+d^\prime d)H^4K})$ based on $\mathcal{O}(dH\log(K))$ number of planning calls, where $d$ and $d^\prime$ are the feature dimensions of the dynamics and rewards, respectively. This theoretical guarantee implies that our algorithm can enable a lifelong learning agent to accumulate experiences and learn to rapidly solve new tasks.
    Bring Your Own Algorithm for Optimal Differentially Private Stochastic Minimax Optimization. (arXiv:2206.00363v1 [cs.LG])
    We study differentially private (DP) algorithms for smooth stochastic minimax optimization, with stochastic minimization as a byproduct. The holy grail of these settings is to guarantee the optimal trade-off between the privacy and the excess population loss, using an algorithm with a linear time-complexity in the number of training samples. We provide a general framework for solving differentially private stochastic minimax optimization (DP-SMO) problems, which enables the practitioners to bring their own base optimization algorithm and use it as a black-box to obtain the near-optimal privacy-loss trade-off. Our framework is inspired from the recently proposed Phased-ERM method [20] for nonsmooth differentially private stochastic convex optimization (DP-SCO), which exploits the stability of the empirical risk minimization (ERM) for the privacy guarantee. The flexibility of our approach enables us to sidestep the requirement that the base algorithm needs to have bounded sensitivity, and allows the use of sophisticated variance-reduced accelerated methods to achieve near-linear time-complexity. To the best of our knowledge, these are the first linear-time optimal algorithms, up to logarithmic factors, for smooth DP-SMO when the objective is (strongly-)convex-(strongly-)concave. Additionally, based on our flexible framework, we derive a new family of near-linear time algorithms for smooth DP-SCO with optimal privacy-loss trade-offs for a wider range of smoothness parameters compared to previous algorithms.
    Graph Neural Networks are Dynamic Programmers. (arXiv:2203.15544v2 [cs.LG] UPDATED)
    Recent advances in neural algorithmic reasoning with graph neural networks (GNNs) are propped up by the notion of algorithmic alignment. Broadly, a neural network will be better at learning to execute a reasoning task (in terms of sample complexity) if its individual components align well with the target algorithm. Specifically, GNNs are claimed to align with dynamic programming (DP), a general problem-solving strategy which expresses many polynomial-time algorithms. However, has this alignment truly been demonstrated and theoretically quantified? Here we show, using methods from category theory and abstract algebra, that there exists an intricate connection between GNNs and DP, going well beyond the initial observations over individual algorithms such as Bellman-Ford. Exposing this connection, we easily verify several prior findings in the literature, produce better-grounded GNN architectures for edge-centric tasks, and demonstrate empirical results on the CLRS algorithmic reasoning benchmark. We hope our exposition will serve as a foundation for building stronger algorithmically aligned GNNs.
    Realistic Deep Learning May Not Fit Benignly. (arXiv:2206.00501v1 [cs.LG])
    Studies on benign overfitting provide insights for the success of overparameterized deep learning models. In this work, we examine the benign overfitting phenomena in real-world settings. We found that for tasks such as training a ResNet model on ImageNet dataset, the model does not fit benignly. To understand why benign overfitting fails in the ImageNet experiment, we analyze previous benign overfitting models under a more restrictive setup where the number of parameters is not significantly larger than the number of data points. Under this mild overparameterization setup, our analysis identifies a phase change: unlike in the heavy overparameterization setting, benign overfitting can now fail in the presence of label noise. Our study explains our empirical observations, and naturally leads to a simple technique known as self-training that can boost the model's generalization performances. Furthermore, our work highlights the importance of understanding implicit bias in underfitting regimes as a future direction.
    Higher-Order Attention Networks. (arXiv:2206.00606v1 [cs.LG])
    This paper introduces higher-order attention networks (HOANs), a novel class of attention-based neural networks defined on a generalized higher-order domain called a combinatorial complex (CC). Similar to hypergraphs, CCs admit arbitrary set-like relations between a collection of abstract entities. Simultaneously, CCs permit the construction of hierarchical higher-order relations analogous to those supported by cell complexes. Thus, CCs effectively generalize both hypergraphs and cell complexes and combine their desirable characteristics. By exploiting the rich combinatorial nature of CCs, HOANs define a new class of message-passing attention-based networks that unifies higher-order neural networks. Our evaluation on tasks related to mesh shape analysis and graph learning demonstrates that HOANs attain competitive, and in some examples superior, predictive performance in comparison to state-of-the-art neural networks.
    Generative multitask learning mitigates target-causing confounding. (arXiv:2202.04136v2 [cs.LG] UPDATED)
    We propose a simple and scalable approach to causal representation learning for multitask learning. Our approach requires minimal modification to existing ML systems, and improves robustness to target shift. The improvement comes from mitigating unobserved confounders that cause the targets, but not the input. We refer to them as target-causing confounders. These confounders induce spurious dependencies between the input and targets. This poses a problem for the conventional approach to multitask learning, due to its assumption that the targets are conditionally independent given the input. Our proposed approach takes into account the dependencies between the targets in order to alleviate target-causing confounding. All that is required in addition to usual practice is to estimate the joint distribution of the targets to switch from discriminative to generative classification, and to predict all targets jointly. Our results on the Attributes of People and Taskonomy datasets reflect the conceptual improvement in robustness to target shift.
    Automatic Bounding Box Annotation with Small Training Data Sets for Industrial Manufacturing. (arXiv:2206.00280v1 [cs.CV])
    In the past few years, object detection has attracted a lot of attention in the context of human-robot collaboration and Industry 5.0 due to enormous quality improvements in deep learning technologies. In many applications, object detection models have to be able to quickly adapt to a changing environment, i.e., to learn new objects. A crucial but challenging prerequisite for this is the automatic generation of new training data which currently still limits the broad application of object detection methods in industrial manufacturing. In this work, we discuss how to adapt state-of-the-art object detection methods for the task of automatic bounding box annotation for the use case where the background is homogeneous and the object's label is provided by a human. We compare an adapted version of Faster R-CNN and the Scaled Yolov4-p5 architecture and show that both can be trained to distinguish unknown objects from a complex but homogeneous background using only a small amount of training data.
    Contextual Bandits with Knapsacks for a Conversion Model. (arXiv:2206.00314v1 [cs.LG])
    We consider contextual bandits with knapsacks, with an underlying structure between rewards generated and cost vectors suffered. We do so motivated by sales with commercial discounts. At each round, given the stochastic i.i.d.\ context $\mathbf{x}_t$ and the arm picked $a_t$ (corresponding, e.g., to a discount level), a customer conversion may be obtained, in which case a reward $r(a,\mathbf{x}_t)$ is gained and vector costs $c(a_t,\mathbf{x}_t)$ are suffered (corresponding, e.g., to losses of earnings). Otherwise, in the absence of a conversion, the reward and costs are null. The reward and costs achieved are thus coupled through the binary variable measuring conversion or the absence thereof. This underlying structure between rewards and costs is different from the linear structures considered by Agrawal and Devanur [2016] but we show that the techniques introduced in this article may also be applied to the latter case. Namely, the adaptive policies exhibited solve at each round a linear program based on upper-confidence estimates of the probabilities of conversion given $a$ and $\mathbf{x}$. This kind of policy is most natural and achieves a regret bound of the typical order (OPT/$B$) $\sqrt{T}$, where $B$ is the total budget allowed, OPT is the optimal expected reward achievable by a static policy, and $T$ is the number of rounds.
    On Quantum Circuits for Discrete Graphical Models. (arXiv:2206.00398v1 [quant-ph])
    Graphical models are useful tools for describing structured high-dimensional probability distributions. Development of efficient algorithms for generating unbiased and independent samples from graphical models remains an active research topic. Sampling from graphical models that describe the statistics of discrete variables is a particularly challenging problem, which is intractable in the presence of high dimensions. In this work, we provide the first method that allows one to provably generate unbiased and independent samples from general discrete factor models with a quantum circuit. Our method is compatible with multi-body interactions and its success probability does not depend on the number of variables. To this end, we identify a novel embedding of the graphical model into unitary operators and provide rigorous guarantees on the resulting quantum state. Moreover, we prove a unitary Hammersley-Clifford theorem -- showing that our quantum embedding factorizes over the cliques of the underlying conditional independence structure. Importantly, the quantum embedding allows for maximum likelihood learning as well as maximum a posteriori state approximation via state-of-the-art hybrid quantum-classical methods. Finally, the proposed quantum method can be implemented on current quantum processors. Experiments with quantum simulation as well as actual quantum hardware show that our method can carry out sampling and parameter learning on quantum computers.
    ForestPrune: Compact Depth-Controlled Tree Ensembles. (arXiv:2206.00128v1 [stat.ML])
    Tree ensembles are versatile supervised learning algorithms that achieve state-of-the-art performance. These models are extremely powerful but can grow to enormous sizes. As a result, tree ensembles are often post-processed to reduce memory footprint and improve interpretability. In this paper, we present ForestPrune, a novel optimization framework that can post-process tree ensembles by pruning depth layers from individual trees. We also develop a new block coordinate descent method to efficiently obtain high-quality solutions to optimization problems under this framework. The number of nodes in a decision tree increases exponentially with tree depth, so pruning deep trees can drastically improve model parsimony. ForestPrune can substantially reduce the space complexity of an ensemble for a minimal cost to performance. The framework supports various weighting schemes and contains just a single hyperparameter to tune. In our experiments, we observe that ForestPrune can reduce model size 20-fold with negligible performance loss.
    Byzantine-Robust Online and Offline Distributed Reinforcement Learning. (arXiv:2206.00165v1 [cs.LG])
    We consider a distributed reinforcement learning setting where multiple agents separately explore the environment and communicate their experiences through a central server. However, $\alpha$-fraction of agents are adversarial and can report arbitrary fake information. Critically, these adversarial agents can collude and their fake data can be of any sizes. We desire to robustly identify a near-optimal policy for the underlying Markov decision process in the presence of these adversarial agents. Our main technical contribution is Weighted-Clique, a novel algorithm for the robust mean estimation from batches problem, that can handle arbitrary batch sizes. Building upon this new estimator, in the offline setting, we design a Byzantine-robust distributed pessimistic value iteration algorithm; in the online setting, we design a Byzantine-robust distributed optimistic value iteration algorithm. Both algorithms obtain near-optimal sample complexities and achieve superior robustness guarantee than prior works.
    Predicting Political Ideology from Digital Footprints. (arXiv:2206.00397v1 [econ.GN])
    This paper proposes a new method to predict individual political ideology from digital footprints on one of the world's largest online discussion forum. We compiled a unique data set from the online discussion forum reddit that contains information on the political ideology of around 91,000 users as well as records of their comment frequency and the comments' text corpus in over 190,000 different subforums of interest. Applying a set of statistical learning approaches, we show that information about activity in non-political discussion forums alone, can very accurately predict a user's political ideology. Depending on the model, we are able to predict the economic dimension of ideology with an accuracy of up to 90.63% and the social dimension with and accuracy of up to 82.02%. In comparison, using the textual features from actual comments does not improve predictive accuracy. Our paper highlights the importance of revealed digital behaviour to complement stated preferences from digital communication when analysing human preferences and behaviour using online data.
    Communication-efficient distributed eigenspace estimation with arbitrary node failures. (arXiv:2206.00127v1 [stat.ML])
    We develop an eigenspace estimation algorithm for distributed environments with arbitrary node failures, where a subset of computing nodes can return structurally valid but otherwise arbitrarily chosen responses. Notably, this setting encompasses several important scenarios that arise in distributed computing and data-collection environments such as silent/soft errors, outliers or corrupted data at certain nodes, and adversarial responses. Our estimator builds upon and matches the performance of a recently proposed non-robust estimator up to an additive $\tilde{O}(\sigma \sqrt{\alpha})$ error, where $\sigma^2$ is the variance of the existing estimator and $\alpha$ is the fraction of corrupted nodes.
    Adversarial Attacks on Gaussian Process Bandits. (arXiv:2110.08449v2 [stat.ML] UPDATED)
    Gaussian processes (GP) are a widely-adopted tool used to sequentially optimize black-box functions, where evaluations are costly and potentially noisy. Recent works on GP bandits have proposed to move beyond random noise and devise algorithms robust to adversarial attacks. This paper studies this problem from the attacker's perspective, proposing various adversarial attack methods with differing assumptions on the attacker's strength and prior information. Our goal is to understand adversarial attacks on GP bandits from theoretical and practical perspectives. We focus primarily on targeted attacks on the popular GP-UCB algorithm and a related elimination-based algorithm, based on adversarially perturbing the function $f$ to produce another function $\tilde{f}$ whose optima are in some target region $\mathcal{R}_{\rm target}$. Based on our theoretical analysis, we devise both white-box attacks (known $f$) and black-box attacks (unknown $f$), with the former including a Subtraction attack and Clipping attack, and the latter including an Aggressive subtraction attack. We demonstrate that adversarial attacks on GP bandits can succeed in forcing the algorithm towards $\mathcal{R}_{\rm target}$ even with a low attack budget, and we test our attacks' effectiveness on a diverse range of objective functions.
    Easy Variational Inference for Categorical Models via an Independent Binary Approximation. (arXiv:2206.00093v1 [stat.ML])
    We pursue tractable Bayesian analysis of generalized linear models (GLMs) for categorical data. Thus far, GLMs are difficult to scale to more than a few dozen categories due to non-conjugacy or strong posterior dependencies when using conjugate auxiliary variable methods. We define a new class of GLMs for categorical data called categorical-from-binary (CB) models. Each CB model has a likelihood that is bounded by the product of binary likelihoods, suggesting a natural posterior approximation. This approximation makes inference straightforward and fast; using well-known auxiliary variables for probit or logistic regression, the product of binary models admits conjugate closed-form variational inference that is embarrassingly parallel across categories and invariant to category ordering. Moreover, an independent binary model simultaneously approximates multiple CB models. Bayesian model averaging over these can improve the quality of the approximation for any given dataset. We show that our approach scales to thousands of categories, outperforming posterior estimation competitors like Automatic Differentiation Variational Inference (ADVI) and No U-Turn Sampling (NUTS) in the time required to achieve fixed prediction quality.
    Multi-block Min-max Bilevel Optimization with Applications in Multi-task Deep AUC Maximization. (arXiv:2206.00260v1 [math.OC])
    In this paper, we study multi-block min-max bilevel optimization problems, where the upper level is non-convex strongly-concave minimax objective and the lower level is a strongly convex objective, and there are multiple blocks of dual variables and lower level problems. Due to the intertwined multi-block min-max bilevel structure, the computational cost at each iteration could be prohibitively high, especially with a large number of blocks. To tackle this challenge, we present a single-loop randomized stochastic algorithm, which requires updates for only a constant number of blocks at each iteration. Under some mild assumptions on the problem, we establish its sample complexity of $\mathcal{O}(1/\epsilon^4)$ for finding an $\epsilon$-stationary point. This matches the optimal complexity for solving stochastic nonconvex optimization under a general unbiased stochastic oracle model. Moreover, we provide two applications of the proposed method in multi-task deep AUC (area under ROC curve) maximization and multi-task deep partial AUC maximization. Experimental results validate our theory and demonstrate the effectiveness of our method on problems with hundreds of tasks.
    Near-Optimal Collaborative Learning in Bandits. (arXiv:2206.00121v1 [cs.LG])
    This paper introduces a general multi-agent bandit model in which each agent is facing a finite set of arms and may communicate with other agents through a central controller in order to identify, in pure exploration, or play, in regret minimization, its optimal arm. The twist is that the optimal arm for each agent is the arm with largest expected mixed reward, where the mixed reward of an arm is a weighted sum of the rewards of this arm for all agents. This makes communication between agents often necessary. This general setting allows to recover and extend several recent models for collaborative bandit learning, including the recently proposed federated learning with personalization (Shi et al., 2021). In this paper, we provide new lower bounds on the sample complexity of pure exploration and on the regret. We then propose a near-optimal algorithm for pure exploration. This algorithm is based on phased elimination with two novel ingredients: a data-dependent sampling scheme within each phase, aimed at matching a relaxation of the lower bound.
    To the Fairness Frontier and Beyond: Identifying, Quantifying, and Optimizing the Fairness-Accuracy Pareto Frontier. (arXiv:2206.00074v1 [stat.ML])
    Algorithmic fairness has emerged as an important consideration when using machine learning to make high-stakes societal decisions. Yet, improved fairness often comes at the expense of model accuracy. While aspects of the fairness-accuracy tradeoff have been studied, most work reports the fairness and accuracy of various models separately; this makes model comparisons nearly impossible without a model-agnostic metric that reflects the balance of the two desiderata. We seek to identify, quantify, and optimize the empirical Pareto frontier of the fairness-accuracy tradeoff. Specifically, we identify and outline the empirical Pareto frontier through Tradeoff-between-Fairness-and-Accuracy (TAF) Curves; we then develop a metric to quantify this Pareto frontier through the weighted area under the TAF Curve which we term the Fairness-Area-Under-the-Curve (FAUC). TAF Curves provide the first empirical, model-agnostic characterization of the Pareto frontier, while FAUC provides the first metric to impartially compare model families on both fairness and accuracy. Both TAF Curves and FAUC can be employed with all group fairness definitions and accuracy measures. Next, we ask: Is it possible to expand the empirical Pareto frontier and thus improve the FAUC for a given collection of fitted models? We answer affirmately by developing a novel fair model stacking framework, FairStacks, that solves a convex program to maximize the accuracy of model ensemble subject to a score-bias constraint. We show that optimizing with FairStacks always expands the empirical Pareto frontier and improves the FAUC; we additionally study other theoretical properties of our proposed approach. Finally, we empirically validate TAF, FAUC, and FairStacks through studies on several real benchmark data sets, showing that FairStacks leads to major improvements in FAUC that outperform existing algorithmic fairness approaches.
    To Collaborate or Not in Distributed Statistical Estimation with Resource Constraints?. (arXiv:2206.00111v1 [cs.DC])
    We study how the amount of correlation between observations collected by distinct sensors/learners affects data collection and collaboration strategies by analyzing Fisher information and the Cramer-Rao bound. In particular, we consider a simple setting wherein two sensors sample from a bivariate Gaussian distribution, which already motivates the adoption of various strategies, depending on the correlation between the two variables and resource constraints. We identify two particular scenarios: (1) where the knowledge of the correlation between samples cannot be leveraged for collaborative estimation purposes and (2) where the optimal data collection strategy involves investing scarce resources to collaboratively sample and transfer information that is not of immediate interest and whose statistics are already known, with the sole goal of increasing the confidence on an estimate of the parameter of interest. We discuss two applications, IoT DDoS attack detection and distributed estimation in wireless sensor networks, that may benefit from our results.
    Decentralized Competing Bandits in Non-Stationary Matching Markets. (arXiv:2206.00120v1 [stat.ML])
    Understanding complex dynamics of two-sided online matching markets, where the demand-side agents compete to match with the supply-side (arms), has recently received substantial interest. To that end, in this paper, we introduce the framework of decentralized two-sided matching market under non stationary (dynamic) environments. We adhere to the serial dictatorship setting, where the demand-side agents have unknown and different preferences over the supply-side (arms), but the arms have fixed and known preference over the agents. We propose and analyze a decentralized and asynchronous learning algorithm, namely Decentralized Non-stationary Competing Bandits (\texttt{DNCB}), where the agents play (restrictive) successive elimination type learning algorithms to learn their preference over the arms. The complexity in understanding such a system stems from the fact that the competing bandits choose their actions in an asynchronous fashion, and the lower ranked agents only get to learn from a set of arms, not \emph{dominated} by the higher ranked agents, which leads to \emph{forced exploration}. With carefully defined complexity parameters, we characterize this \emph{forced exploration} and obtain sub-linear (logarithmic) regret of \texttt{DNCB}. Furthermore, we validate our theoretical findings via experiments.
    Asymptotics of $\ell_2$ Regularized Network Embeddings. (arXiv:2201.01689v2 [stat.ML] UPDATED)
    A common approach to solving prediction tasks on large networks, such as node classification or link prediction, begin by learning a Euclidean embedding of the nodes of the network, from which traditional machine learning methods can then be applied. This includes methods such as DeepWalk and node2vec, which learn embeddings by optimizing stochastic losses formed over subsamples of the graph at each iteration of stochastic gradient descent. In this paper, we study the effects of adding an $\ell_2$ penalty of the embedding vectors to the training loss of these types of methods. We prove that, under some exchangeability assumptions on the graph, this asymptotically leads to learning a graphon with a nuclear-norm-type penalty, and give guarantees for the asymptotic distribution of the learned embedding vectors. In particular, the exact form of the penalty depends on the choice of subsampling method used as part of stochastic gradient descent. We also illustrate empirically that concatenating node covariates to $\ell_2$ regularized node2vec embeddings leads to comparable, when not superior, performance to methods which incorporate node covariates and the network structure in a non-linear manner.
    Consistent Collaborative Filtering via Tensor Decomposition. (arXiv:2201.11936v2 [cs.IR] UPDATED)
    Collaborative filtering is the de facto standard for analyzing users' activities and building recommendation systems for items. In this work we develop Sliced Anti-symmetric Decomposition (SAD), a new model for collaborative filtering based on implicit feedback. In contrast to traditional techniques where a latent representation of users (user vectors) and items (item vectors) are estimated, SAD introduces one additional latent vector to each item, using a novel three-way tensor view of user-item interactions. This new vector extends user-item preferences calculated by standard dot products to general inner products, producing interactions between items when evaluating their relative preferences and bringing fundamental new information into recommendation. SAD reduces to state-of-the-art (SOTA) collaborative filtering models when the vector collapses to 1 (no new information), while in this paper we allow its value to be estimated from data. The proposed SAD model is simple, resulting in an efficient group stochastic gradient descent (SGD) algorithm. We demonstrate the efficiency of SAD in both simulated and real world datasets containing over 1M user-item interactions. By comparing SAD with seven alternative SOTA collaborative filtering models, we show that SAD is not only able to more consistently estimate personalized preferences, but also produce more accurate personalized recommendations. We release the model and inference algorithms in a Python library https://github.com/apple/ml-sad.
    Feature Selection for Discovering Distributional Treatment Effect Modifiers. (arXiv:2206.00516v1 [cs.LG])
    Finding the features relevant to the difference in treatment effects is essential to unveil the underlying causal mechanisms. Existing methods seek such features by measuring how greatly the feature attributes affect the degree of the {\it conditional average treatment effect} (CATE). However, these methods may overlook important features because CATE, a measure of the average treatment effect, cannot detect differences in distribution parameters other than the mean (e.g., variance). To resolve this weakness of existing methods, we propose a feature selection framework for discovering {\it distributional treatment effect modifiers}. We first formulate a feature importance measure that quantifies how strongly the feature attributes influence the discrepancy between potential outcome distributions. Then we derive its computationally efficient estimator and develop a feature selection algorithm that can control the type I error rate to the desired level. Experimental results show that our framework successfully discovers important features and outperforms the existing mean-based method.
    Normalization effects on shallow neural networks and related asymptotic expansions. (arXiv:2011.10487v3 [stat.ML] UPDATED)
    We consider shallow (single hidden layer) neural networks and characterize their performance when trained with stochastic gradient descent as the number of hidden units $N$ and gradient descent steps grow to infinity. In particular, we investigate the effect of different scaling schemes, which lead to different normalizations of the neural network, on the network's statistical output, closing the gap between the $1/\sqrt{N}$ and the mean-field $1/N$ normalization. We develop an asymptotic expansion for the neural network's statistical output pointwise with respect to the scaling parameter as the number of hidden units grows to infinity. Based on this expansion, we demonstrate mathematically that to leading order in $N$, there is no bias-variance trade off, in that both bias and variance (both explicitly characterized) decrease as the number of hidden units increases and time grows. In addition, we show that to leading order in $N$, the variance of the neural network's statistical output decays as the implied normalization by the scaling parameter approaches the mean field normalization. Numerical studies on the MNIST and CIFAR10 datasets show that test and train accuracy monotonically improve as the neural network's normalization gets closer to the mean field normalization.
    Concentration Inequalities for Two-Sample Rank Processes with Application to Bipartite Ranking. (arXiv:2104.02943v2 [math.ST] UPDATED)
    The ROC curve is the gold standard for measuring the performance of a test/scoring statistic regarding its capacity to discriminate between two statistical populations in a wide variety of applications, ranging from anomaly detection in signal processing to information retrieval, through medical diagnosis. Most practical performance measures used in scoring/ranking applications such as the AUC, the local AUC, the p-norm push, the DCG and others, can be viewed as summaries of the ROC curve. In this paper, the fact that most of these empirical criteria can be expressed as two-sample linear rank statistics is highlighted and concentration inequalities for collections of such random variables, referred to as two-sample rank processes here, are proved, when indexed by VC classes of scoring functions. Based on these nonasymptotic bounds, the generalization capacity of empirical maximizers of a wide class of ranking performance criteria is next investigated from a theoretical perspective. It is also supported by empirical evidence through convincing numerical experiments.
    Sampling from Log-Concave Distributions with Infinity-Distance Guarantees. (arXiv:2111.04089v2 [cs.DS] UPDATED)
    For a $d$-dimensional log-concave distribution $\pi(\theta) \propto e^{-f(\theta)}$ constrained to a convex body $K$, the problem of outputting samples from a distribution $\nu$ which is $\varepsilon$-close in infinity-distance $\sup_{\theta \in K} |\log \frac{\nu(\theta)}{\pi(\theta)}|$ to $\pi$ arises in differentially private optimization. While sampling within total-variation distance $\varepsilon$ of $\pi$ can be done by algorithms whose runtime depends polylogarithmically on $\frac{1}{\varepsilon}$, prior algorithms for sampling in $\varepsilon$ infinity distance have runtime bounds that depend polynomially on $\frac{1}{\varepsilon}$. We bridge this gap by presenting an algorithm that outputs a point $\varepsilon$-close to $\pi$ in infinity distance that requires at most $\mathrm{poly}(\log \frac{1}{\varepsilon}, d)$ calls to a membership oracle for $K$ and evaluation oracle for $f$, when $f$ is Lipschitz. Our approach departs from prior works that construct Markov chains on a $\frac{1}{\varepsilon^2}$-discretization of $K$ to achieve a sample with $\varepsilon$ infinity-distance error, and present a method to directly convert continuous samples from $K$ with total-variation bounds to samples with infinity bounds. This approach also allows us to obtain an improvement on the dimension $d$ in the running time for the problem of sampling from a log-concave distribution on polytopes $K$ with infinity distance $\varepsilon$, by plugging in TV-distance running time bounds for the Dikin Walk Markov chain.
    Pre-training via Denoising for Molecular Property Prediction. (arXiv:2206.00133v1 [cs.LG])
    Many important problems involving molecular property prediction from 3D structures have limited data, posing a generalization challenge for neural networks. In this paper, we describe a pre-training technique that utilizes large datasets of 3D molecular structures at equilibrium to learn meaningful representations for downstream tasks. Inspired by recent advances in noise regularization, our pre-training objective is based on denoising. Relying on the well-known link between denoising autoencoders and score-matching, we also show that the objective corresponds to learning a molecular force field -- arising from approximating the physical state distribution with a mixture of Gaussians -- directly from equilibrium structures. Our experiments demonstrate that using this pre-training objective significantly improves performance on multiple benchmarks, achieving a new state-of-the-art on the majority of targets in the widely used QM9 dataset. Our analysis then provides practical insights into the effects of different factors -- dataset sizes, model size and architecture, and the choice of upstream and downstream datasets -- on pre-training.
    Identifying the latent space geometry of network models through analysis of curvature. (arXiv:2012.10559v4 [stat.ME] UPDATED)
    The study of statistical models of network structure, pursued across numerous disciplines and contexts, is fundamentally challenging because of (often high-order) dependence between connections. A common approach assigns each person in the graph to a position on a low-dimensional manifold. Distance between individuals in this (latent) space is inversely proportional to the likelihood of forming a connection. The choice of the latent geometry (the manifold class, dimension, and curvature) has consequential impacts on the substantive conclusions drawn from the model. More positive curvature in the manifold, for example, encourages more and tighter communities; negative curvature induces repulsion among nodes. Currently, however, the choice of the latent geometry is an a priori modeling assumption and there is limited guidance about how to make these choices in a data-driven way. In this work, we present a method to consistently estimate the manifold type, dimension, and curvature from an empirically relevant class of latent spaces: simply connected, complete Riemannian manifolds of constant curvature. Our core insight comes by representing the graph as a noisy distance matrix based on the ties between groups of nodes: either cliques, or in the case where the researcher observes traits, trait-groups. Leveraging results from statistical geometry, we develop hypothesis tests to determine whether the observed distances could plausibly be embedded isometrically in each of the candidate geometries. The method applies when the researcher observes the full graph and also to empirically relevant cases where only partial data is observed. We explore the accuracy of our approach with simulations and then apply our approach to data-sets from economics and sociology as well as neuroscience.
    Learning from Small Samples: Transformation-Invariant SVMs with Composition and Locality at Multiple Scales. (arXiv:2109.12784v4 [cs.LG] UPDATED)
    Motivated by the problem of learning with small sample sizes, this paper shows how to incorporate into support-vector machines (SVMs) those properties that have made convolutional neural networks (CNNs) successful. Particularly important is the ability to incorporate domain knowledge of invariances, e.g., translational invariance of images. Kernels based on the \textit{maximum} similarity over a group of transformations are not generally positive definite. Perhaps it is for this reason that they have not been studied theoretically. We address this lacuna and show that positive definiteness indeed holds \textit{with high probability} for kernels based on the maximum similarity in the small training sample set regime of interest, and that they do yield the best results in that regime. We also show how additional properties such as their ability to incorporate local features at multiple spatial scales, e.g., as done in CNNs through max pooling, and to provide the benefits of composition through the architecture of multiple layers, can also be embedded into SVMs. We verify through experiments on widely available image sets that the resulting SVMs do provide superior accuracy in comparison to well-established deep neural network benchmarks for small sample sizes.
    Transformer with Fourier Integral Attentions. (arXiv:2206.00206v1 [cs.LG])
    Multi-head attention empowers the recent success of transformers, the state-of-the-art models that have achieved remarkable success in sequence modeling and beyond. These attention mechanisms compute the pairwise dot products between the queries and keys, which results from the use of unnormalized Gaussian kernels with the assumption that the queries follow a mixture of Gaussian distribution. There is no guarantee that this assumption is valid in practice. In response, we first interpret attention in transformers as a nonparametric kernel regression. We then propose the FourierFormer, a new class of transformers in which the dot-product kernels are replaced by the novel generalized Fourier integral kernels. Different from the dot-product kernels, where we need to choose a good covariance matrix to capture the dependency of the features of data, the generalized Fourier integral kernels can automatically capture such dependency and remove the need to tune the covariance matrix. We theoretically prove that our proposed Fourier integral kernels can efficiently approximate any key and query distributions. Compared to the conventional transformers with dot-product attention, FourierFormers attain better accuracy and reduce the redundancy between attention heads. We empirically corroborate the advantages of FourierFormers over the baseline transformers in a variety of practical applications including language modeling and image classification.
    Lower and Upper Bounds for Numbers of Linear Regions of Graph Convolutional Networks. (arXiv:2206.00228v1 [cs.LG])
    The research for characterizing GNN expressiveness attracts much attention as graph neural networks achieve a champion in the last five years. The number of linear regions has been considered a good measure for the expressivity of neural networks with piecewise linear activation. In this paper, we present some estimates for the number of linear regions of the classic graph convolutional networks (GCNs) with one layer and multiple-layer scenarios. In particular, we obtain an optimal upper bound for the maximum number of linear regions for one-layer GCNs, and the upper and lower bounds for multi-layer GCNs. The simulated estimate shows that the true maximum number of linear regions is possibly closer to our estimated lower bound. These results imply that the number of linear regions of multi-layer GCNs is exponentially greater than one-layer GCNs per parameter in general. This suggests that deeper GCNs have more expressivity than shallow GCNs.
    The Dimpled Manifold Model of Adversarial Examples in Machine Learning. (arXiv:2106.10151v2 [cs.LG] UPDATED)
    The extreme fragility of deep neural networks, when presented with tiny perturbations in their inputs, was independently discovered by several research groups in 2013. However, despite enormous effort, these adversarial examples remained a counterintuitive phenomenon with no simple testable explanation. In this paper, we introduce a new conceptual framework for how the decision boundary between classes evolves during training, which we call the {\em Dimpled Manifold Model}. In particular, we demonstrate that training is divided into two distinct phases. The first phase is a (typically fast) clinging process in which the initially randomly oriented decision boundary gets very close to the low dimensional image manifold, which contains all the training examples. Next, there is a (typically slow) dimpling phase which creates shallow bulges in the decision boundary that move it to the correct side of the training examples. This framework provides a simple explanation for why adversarial examples exist, why their perturbations have such tiny norms, and why they look like random noise rather than like the target class. This explanation is also used to show that a network that was adversarially trained with incorrectly labeled images might still correctly classify most test images, and to show that the main effect of adversarial training is just to deepen the generated dimples in the decision boundary. Finally, we discuss and demonstrate the very different properties of on-manifold and off-manifold adversarial perturbations. We describe the results of numerous experiments which strongly support this new model, using both low dimensional synthetic datasets and high dimensional natural datasets.  ( 2 min )
    Generative Modeling Helps Weak Supervision (and Vice Versa). (arXiv:2203.12023v3 [cs.LG] UPDATED)
    Many promising applications of supervised machine learning face hurdles in the acquisition of labeled data in sufficient quantity and quality, creating an expensive bottleneck. To overcome such limitations, techniques that do not depend on ground truth labels have been studied, including weak supervision and generative modeling. While these techniques would seem to be usable in concert, improving one another, how to build an interface between them is not well-understood. In this work, we propose a model fusing programmatic weak supervision and generative adversarial networks and provide theoretical justification motivating this fusion. The proposed approach captures discrete latent variables in the data alongside the weak supervision derived label estimate. Alignment of the two allows for better modeling of sample-dependent accuracies of the weak supervision sources, improving the estimate of unobserved labels. It is the first approach to enable data augmentation through weakly supervised synthetic images and pseudolabels. Additionally, its learned latent variables can be inspected qualitatively. The model outperforms baseline weak supervision label models on a number of multiclass image classification datasets, improves the quality of generated images, and further improves end-model performance through data augmentation with synthetic samples.  ( 2 min )
    Elucidating the Design Space of Diffusion-Based Generative Models. (arXiv:2206.00364v1 [cs.CV])
    We argue that the theory and practice of diffusion-based generative models are currently unnecessarily convoluted and seek to remedy the situation by presenting a design space that clearly separates the concrete design choices. This lets us identify several changes to both the sampling and training processes, as well as preconditioning of the score networks. Together, our improvements yield new state-of-the-art FID of 1.79 for CIFAR-10 in a class-conditional setting and 1.97 in an unconditional setting, with much faster sampling (35 network evaluations per image) than prior designs. To further demonstrate their modular nature, we show that our design changes dramatically improve both the efficiency and quality obtainable with pre-trained score networks from previous work, including improving the FID of an existing ImageNet-64 model from 2.07 to near-SOTA 1.55.  ( 2 min )
    Algorithmic Foundation of Deep X-Risk Optimization. (arXiv:2206.00439v1 [cs.LG])
    X-risk is a term introduced to represent a family of compositional measures or objectives, in which each data point is compared with a set of data points explicitly or implicitly for defining a risk function. It includes many widely used measures or objectives, e.g., AUROC, AUPRC, partial AUROC, NDCG, MAP, top-$K$ NDCG, top-$K$ MAP, listwise losses, p-norm push, top push, precision/recall at top $K$ positions, precision at a certain recall level, contrastive objectives, etc. While these measures/objectives and their optimization algorithms have been studied in the literature of machine learning, computer vision, information retrieval, and etc, optimizing these measures/objectives has encountered some unique challenges for deep learning. In this technical report, we survey our recent rigorous efforts for deep X-risk optimization (DXO) by focusing on its algorithmic foundation. We introduce a class of techniques for optimizing X-risk for deep learning. We formulate DXO into three special families of non-convex optimization problems belonging to non-convex min-max optimization, non-convex compositional optimization, and non-convex bilevel optimization, respectively. For each family of problems, we present some strong baseline algorithms and their complexities, which will motivate further research for improving the existing results. Discussions about the presented results and future studies are given at the end.  ( 2 min )
    Continuous Prediction with Experts' Advice. (arXiv:2206.00236v1 [cs.LG])
    Prediction with experts' advice is one of the most fundamental problems in online learning and captures many of its technical challenges. A recent line of work has looked at online learning through the lens of differential equations and continuous-time analysis. This viewpoint has yielded optimal results for several problems in online learning. In this paper, we employ continuous-time stochastic calculus in order to study the discrete-time experts' problem. We use these tools to design a continuous-time, parameter-free algorithm with improved guarantees for the quantile regret. We then develop an analogous discrete-time algorithm with a very similar analysis and identical quantile regret bounds. Finally, we design an anytime continuous-time algorithm with regret matching the optimal fixed-time rate when the gains are independent Brownian Motions; in many settings, this is the most difficult case. This gives some evidence that, even with adversarial gains, the optimal anytime and fixed-time regrets may coincide.  ( 2 min )
    Amortized backward variational inference in nonlinear state-space models. (arXiv:2206.00319v1 [stat.ME])
    We consider the problem of state estimation in general state-space models using variational inference. For a generic variational family defined using the same backward decomposition as the actual joint smoothing distribution, we establish for the first time that, under mixing assumptions, the variational approximation of expectations of additive state functionals induces an error which grows at most linearly in the number of observations. This guarantee is consistent with the known upper bounds for the approximation of smoothing distributions using standard Monte Carlo methods. Moreover, we propose an amortized inference framework where a neural network shared over all times steps outputs the parameters of the variational kernels. We also study empirically parametrizations which allow analytical marginalization of the variational distributions, and therefore lead to efficient smoothing algorithms. Significant improvements are made over state-of-the art variational solutions, especially when the generative model depends on a strongly nonlinear and noninjective mixing function.
    Provably and Practically Efficient Neural Contextual Bandits. (arXiv:2206.00099v1 [stat.ML])
    We consider the neural contextual bandit problem. In contrast to the existing work which primarily focuses on ReLU neural nets, we consider a general set of smooth activation functions. Under this more general setting, (i) we derive non-asymptotic error bounds on the difference between an overparameterized neural net and its corresponding neural tangent kernel, (ii) we propose an algorithm with a provably sublinear regret bound that is also efficient in the finite regime as demonstrated by empirical studies. The non-asymptotic error bounds may be of broader interest as a tool to establish the relation between the smoothness of the activation functions in neural contextual bandits and the smoothness of the kernels in kernel bandits.
    Transfer without Forgetting. (arXiv:2206.00388v1 [cs.LG])
    This work investigates the entanglement between Continual Learning (CL) and Transfer Learning (TL). In particular, we shed light on the widespread application of network pretraining, highlighting that it is itself subject to catastrophic forgetting. Unfortunately, this issue leads to the under-exploitation of knowledge transfer during later tasks. On this ground, we propose Transfer without Forgetting (TwF), a hybrid Continual Transfer Learning approach building upon a fixed pretrained sibling network, which continuously propagates the knowledge inherent in the source domain through a layer-wise loss term. Our experiments indicate that TwF steadily outperforms other CL methods across a variety of settings, averaging a 4.81% gain in Class-Incremental accuracy over a variety of datasets and different buffer sizes.
    Learning a performance metric of Buchberger's algorithm. (arXiv:2106.03676v2 [math.AC] UPDATED)
    What can be (machine) learned about the complexity of Buchberger's algorithm? Given a system of polynomials, Buchberger's algorithm computes a Gr\"obner basis of the ideal these polynomials generate using an iterative procedure based on multivariate long division. The runtime of each step of the algorithm is typically dominated by a series of polynomial additions, and the total number of these additions is a hardware independent performance metric that is often used to evaluate and optimize various implementation choices. In this work we attempt to predict, using just the starting input, the number of polynomial additions that take place during one run of Buchberger's algorithm. Good predictions are useful for quickly estimating difficulty and understanding what features make Gr\"obner basis computation hard. Our features and methods could also be used for value models in the reinforcement learning approach to optimize Buchberger's algorithm introduced in [Peifer, Stillman, and Halpern-Leistner, 2020]. We show that a multiple linear regression model built from a set of easy-to-compute ideal generator statistics can predict the number of polynomial additions somewhat well, better than an uninformed model, and better than regression models built on some intuitive commutative algebra invariants that are more difficult to compute. We also train a simple recursive neural network that outperforms these linear models. Our work serves as a proof of concept, demonstrating that predicting the number of polynomial additions in Buchberger's algorithm is a feasible problem from the point of view of machine learning.

  • Open

    [R] Attribution-based Explanations that Provide Recourse Cannot be Robust
    submitted by /u/hardmaru [link] [comments]
    "[Project]" Brainchop: In-browser deep learning framework for volumetric Segmentation
    This is a follow-up on a post from two weeks ago about brainchop.org tool. We released a discussion board to share ideas with our supporters. ​ https://preview.redd.it/idl8o4oz73391.png?width=570&format=png&auto=webp&s=04605e074e16adae4993b708a268f04c3988aaa3 submitted by /u/Character-Rip-5824 [link] [comments]
    [R] You Can't Count on Luck: Why Decision Transformers Fail in Stochastic Environments
    Paper: https://arxiv.org/abs/2205.15967 Website: https://sites.google.com/view/esper-paper submitted by /u/Keirp [link] [comments]
    [N] We released a new tool on the App Store to annotate your images. On your iPad.
    App Store — ToolZ It’s called ToolZ. 3 years in the making, rewritten from scratch in 6 months (Swift)… We’ve used it to ship three models to production (got the scars on my body to show for it). And label more than 50’000 images — hence affording the time to get something that works properly for large volumes of data. My 3 favorite features: annotate from your couch, using the pencil. (Or sitting on a plane in a transatlantic flight, did that) Boxes are written into the EXIF of the images, and you can therefore move them around without tracking them load huge archives (tested up to 6Gb zip or tar.gz) and annotate directly in them Try it free for 7 days, we hope you’ll stick along for the ride with us after. submitted by /u/National-Tennis-4528 [link] [comments]  ( 1 min )
    [N] The Hugging Face Hub has brand new docs
    Hi there! Omar from Hugging Face here 🦙🤗 We've been running the Hub for a while, and the community reception has been great! With a bunch of new models, datasets and demos being added every day, it seems fair to say people are enjoying it. After taking in feedback and releasing some new features, we figured it was time to give our documentation some love. Take a look at our revamped Hub docs! We've restructured the documentation around some central components (models, datasets, and Spaces) and added a bunch of new content that we hope will make the Hub even more handy and easy-to-use! As usual, the docs are open-source, so PRs are always welcome if you spot any room for improvements. submitted by /u/hackerllama [link] [comments]  ( 1 min )
    [D] Adjoint Sensitivity Method vs Reverse Mode Autodiff
    Can anyone explain to me why Adjoint Sensitivity is considered the most efficient way to calculate the gradient of a loss function with respect to an ODE's parameters? x_dot = f(x; theta), L(x x_f) = (x - x_f)^2 I first found out about adjoint sensitivity in the NueralODE paper. And since then I've seen numerous sources on the internet claiming that adjoint sensitivity is the most efficient way to compute the gradient. I don't see why this would be the case. I implemented ASM for a system and compared the performance against reverse mode autodiff of the numerical solution. I didn't notice a significant performance improvement with ASM. Additionally, it seems to me like they both require the same number of integration steps and roughly the same number of evaluations of the function evaluations. One reason ASM might be faster is because I read there are efficient ways to compute vector Jacobian product in autodiff libraries. Also an explanation for my implementation performance would be that the library I was using didn't have this feature. submitted by /u/LiquidDinosaurs69 [link] [comments]  ( 2 min )
    [P] MLEM: ML model deployment tool
    Hi, I'm one of the project creators. MLEM is a tool that helps you deploy your ML models. It’s a Python library + Command line tool. MLEM can package an ML model into a Docker image or a Python package, and deploy it to, for example, Heroku. MLEM saves all model metadata to a human-readable text file: Python environment, model methods, model input & output data schema and more. MLEM helps you turn your Git repository into a Model Registry with features like ML model lifecycle management. Our philosophy is that MLOps tools should be built using the Unix approach - each tool solves a single problem, but solves it very well. MLEM was designed to work hands on hands with Git - it saves all model metadata to a human-readable text files and Git becomes a source of truth for ML models. Model weights file can be stored in the cloud storage using a Data Version Control tool or such - independently of MLEM. Please check out the project: https://github.com/iterative/mlem and the website: https://mlem.ai I’d love to hear your feedback! submitted by /u/1aguschin [link] [comments]  ( 1 min )
    [P] Turn any Jupyter cell into a shareable web-hosted Python script
    ​ https://reddit.com/link/v2hq70/video/lzd4xeggm0391/player This is a follow-up on a post from two weeks ago. To communicate results, data scientists often resort to screenshotting plots and sharing them over Slack or email. Visualizations are great to summarize findings, but they're often ambiguous and lack context. To make communication easier, we've created a new %%publish magic (previously named %%share) that decouples a Jupyter cell from its notebook and transforms it into an standalone web-hosted Python script, easily shareable and runnable. While we've supported the ability to share cells publicly, many of you expressed that the results you want to share need to remain private. Today, we're excited to announce that the Jupyter cells you publish are now private by default. Running %%publish # Your Python code goes here.. will bring you to the web app, where you can select the list of person to share with. Private sharing requires you to create an account. If you don't want to, you can still share publicly by using the following: %%publish --public # Your Python code goes here.. Try it out in Colab: https://colab.research.google.com/drive/1E5oU6TjH6OocmvEfU-foJfvCTbTfQrqd?usp=sharing Docs: https://docs.1000words-hq.com/ Source code of the Python client: https://github.com/edouard-g/thousandwords Homepage: https://1000words-hq.com submitted by /u/Left_Ad8361 [link] [comments]  ( 1 min )
    [P] what is the most efficient way to pattern matching word-to-word?
    I want to perform pattern matching task as a part of Pre-processing. I have more than 3M text sentences. On the other hand, I have around 130k terms (may contain multi-word terms separated by spaces) which I want to match with the text in those 3M sentences. The expected output is the matched terms per sentence, if any. Is there any efficient way you know of? I am also considering lowercasing text on both the sides as pattern matching is allowed to be case-insensitive. submitted by /u/inFamous_16 [link] [comments]  ( 2 min )
    [P]MMML | Deploy HuggingFace training model rapidly based on MetaSpore
    A few days ago, HuggingFace announced a $100 million Series C funding round, which was big news in open source machine learning and could be a sign of where the industry is headed. Two days before the HuggingFace funding announcement, open-source machine learning platform MetaSpore released a demo based on the HuggingFace Rapid deployment pre-training model. As deep learning technology makes innovative breakthroughs in computer vision, natural language processing, speech understanding, and other fields, more and more unstructured data are perceived, understood, and processed by machines. These advances are mainly due to the powerful learning ability of deep learning. Through pre-training of deep models on massive data, the models can capture the internal data patterns, thus helping many d…  ( 11 min )
    [D] Minimum Description Length applied to KMeans
    Is anyone familiar with the Minimum Description Length principle? Can somebody help me derive the formula to apply it to KMeans? I want to apply it to evaluate the trade-off between different complexity models - where different models are given different number of principal components following a PCA of high dimensional data. This is the basic formula for MDL on statistical models, where P is the number of free parameters, X are the data points, |X| is the number (count) of data points, p(x) is the probability distribution density at the given data point. https://preview.redd.it/xg6my8sh6z291.png?width=1264&format=png&auto=webp&s=d7ad99aa2096514ee3619ed085cc0aea4f3b896a I am especially confused about what to use for the number of free parameters (P) and any help and insights would be greatly appreciated submitted by /u/Rafaelkoll [link] [comments]  ( 1 min )
    [R] Multi-Agent Reinforcement Learning can now be solved by the Transformer!
    ​ Multi-Agent Transformer Large sequence models (BERT, GPT-series) have demonstrated remarkable progress on visual language tasks. However, how to abstract RL/MARL problems into a sequence modelling problem is still unknown. Here we introduce Multi-Agent Transformer that naturally turns MARL problem into a sequence modelling problem. The key insight is the multi-agent advantage decomposition theorem (a lemma we happen to discover during the development of HATRPO/HAPPO [ICLR 22] https://openreview.net/forum?id=EcGGFkNTxdJ), which surprisingly and effectively turns multi-agent learning problems into sequential decision-making problems, thus MARL is implementable and solvable by the decoder architecture in the Transformer, with no hacks needed at all! MAT is different from Decision Transformer or GATO which are purely trained on pre-collected offline demonstration data (more like a supervised learning task), but rather MAT is trained online by trails and errors (also, it is an on-policy RL method). Experiments on StarCraft II, Bimanual Dexterous Hands, MA-MuJoCo, and Google Football show MAT's superior performance (stronger than MAPPO and HAPPO). Check our paper & project page at: https://arxiv.org/abs/2205.14953 submitted by /u/yyang_13 [link] [comments]  ( 1 min )
    [D] Has anyone used Low-Rank Decomposition to reduce model size / latency?
    I am working on a project and wondering if any one has used any kind of low-rank decomposition to reduce the overhead of large neural net layers. For instance, I can think of reducing a dense layer size, by decomposing it into two smaller layers, while retaining the output dimensions to be the same. Can this be done through SVD, PCA, or some other means? Interested in learning about people's experiences. submitted by /u/red_dragon [link] [comments]  ( 1 min )
    [D] Unreal Engine 5 vs Unity for ML Research
    I’m working as part of a research team that is looking into Safe Reinforcement Learning. We are developing an algorithm that we want to deploy on 3D games that our own agents can learn in. Does anyone have any opinions on the new Unreal Engine 5 for ML research and how that compares to Unity ML? submitted by /u/TerrificJam [link] [comments]  ( 1 min )
  • Open

    would someone be able to give me a basic list of info about GPUs and CPUs for DRL?
    submitted by /u/No_Possibility_7588 [link] [comments]  ( 1 min )
    CPU and GPU on a cluster: how many cores and how much memory?
    Hi :) I am using CPU and GPU from a cluster. I am requested to specify some info but it's my first applied project, so I'm not sure about this. My project is in multi-agent RL (3 agents) and the model has multi-head attention. How would you specify the following parameters? #SBATCH --gpus-per-node= ?? # Number of GPU(s) per node #SBATCH --cpus-per-task= ?? # CPU cores/threads SBATCH --mem=?? #Memory per node Also, should i only specify values for either the GPU or the CPU or both? Thanks! submitted by /u/No_Possibility_7588 [link] [comments]  ( 1 min )
    In multi armed bandit settings, how do you use logged data to determine the logged policy?
    I’m fairly new to reinforcement learning and multi armed bandit problems, so apologies for a possibly silly question. I have logged data of the form {(x, y, delta)} where x represents the context, y represents the action, and delta represent the observed reward. In a bandit feedback setting (where only the reward of the action taken is observed), how do we translate this dataset into a policy? Im confused because if the action space is Y = {0, 1}, we only observe the result of one decision. How can we build a policy that generates the propensities (or probability distribution ) for all actions given its context if we’re only given the factual outcomes and know nothing about the counterfactuals? Thanks! submitted by /u/WirrryWoo [link] [comments]  ( 2 min )
    Here's a look of the early testing stages of our reinforcement learning training system in Animo Island! Through the process of rewarding and training your Animo, you can teach it to perform actions on the island from which it can learn to make better decisions over time.
    submitted by /u/AnimoIsland [link] [comments]  ( 1 min )
    Renderer function from gym not found
    I'm trying to build a simple pygame renderer following the guidelines at https://www.gymlibrary.ml/content/environment_creation/#rendering however the function Renderer is not available from gym.utils.renderer. I have installed gym version 0.23.1. submitted by /u/Tuxliri [link] [comments]  ( 1 min )
    Practical tips on how to efficiently manage a deep RL project
    Hi all, I am trying to improve my skills in DRL, but I am doing that with no supervision at all so sometimes I feel a bit lost. I need your advice on something. I have written my own model and I have done some major debugging. Now, I am ready to train and test on GPU. Do you have any tips on how to speed things up and on what tools to use for each of these points? In particular: - Hyperparameter search: what tools do you recommend and do you have best practices? - Logging: what do you use to see how the training is going? And how do you make sure you can see it live and not only after the training is done? - Interpreting the results (i.e. making sure the agent is learning properly) But again, I am sure there are other critical aspects that I am not thinking about right now and that it would be better to take into account. Thanks! submitted by /u/No_Possibility_7588 [link] [comments]  ( 2 min )
    RL for attacking intelligent dialogue systems
    Hello fellas, I am currently working on using RL to attack task oriented dialogue systems and I'm completely new to this. I made a choice for a state of the art dialogue system that is already trained and tested on a multi-domain dialogue dataset which. The system predicts an appropriate answer for each user utterance in a dialogue. The main idea is to to apply transformation on the original utterance to perturbate the model's behaviour and make it less accurate and consistant. I would consider the transformatios as actions to be taken by the RL agent. The problem that I am finding struggles when it comes to designing the solution since it seems that there is an infinity of possible actions to take. I'm also confused about what should be considered as states as well as weather to choose the whole dialogue or just a turn at once as a game and more importantly if someone knows a suitable RL algorithm for this kind of problems. Any help, constructive advice or insight would be helpful for me. Thank you. submitted by /u/Unhappy_Economics_59 [link] [comments]  ( 1 min )
  • Open

    I'm certain this will be the last large stumbling block for AI image generation.
    submitted by /u/joshhammock [link] [comments]
    an AI that understands scientific literature, and can intelligently discuss scientific topics - has this been attempted?
    This is prompted by finding out that current AIs like PaLM and GPT-3 can do shockingly good jobs at explaining jokes - which has me now wondering, could an AI trained on scientific literature do a superhuman job at answering questions related to scientific topics? The potential value to society is huge - it could rapidly accelerate scientific progress if there were an AI able to read and understand the entire scientific literature (or even just specific fields.) Even if it will never provide perfect information, conversing with it could spur productive new ideas, and it could provide citations to published articles which may be relevant. It could be a hybrid of a NLP model, and ideally coupled with a framework of scientific models - perhaps including both mathematical models like Maxwell's equations, tabular data like material property databases, along with conceptual models like the biological "tree of life". I have found there is an AI that could score 90% on a 12th-grade science test in 2019, and another AI from 2019 that could read scientific abstracts and predict physical properties of materials. But, I can't find any attempts to train a massive GPT-3-like model on the full scientific literature or couple that NLP model with other knowledge sources. Is anyone attempting such a thing? submitted by /u/Pandaemonium [link] [comments]  ( 2 min )
    MLOps Is a Mess: A Conversation with Mihail Eric
    submitted by /u/julie_ai [link] [comments]
    AI generated and copyright law?
    Hi! I have a big question about AI generated content. Can AI-created things be sold under your own name / artistic nickname? I have an idea for a large project and I would like to know immediately if I can mix (for example) graphics made by AI with my own description. For example, a picture of a tree made by AI and something I wrote underneath it. For example ofc. submitted by /u/VRGvks [link] [comments]  ( 1 min )
    I used GPT-3 to write an entirely fake Spider-Man movie, then animated the results.
    submitted by /u/B_Ray18 [link] [comments]
    This model makes you speak another language
    submitted by /u/limapedro [link] [comments]
    Aiplague - Queen (4K 60 FPS) Disco Diffusion
    submitted by /u/nalr00n [link] [comments]
    AI has already won, it's their world now.
    A friend of mine had this very real conversation with an OpenAI bot this morning. I for one welcome our AI overlords: Human: Hello, who are you? AI: I am an AI created by OpenAI. How can I help you today? Human: are you a bot? AI: I am an AI created by OpenAI. I can help you with anything you need help with. Human: can I ask you a trick question? AI: Of course! Go ahead and ask away. Human: how many fingers am I holding up? AI: You are holding up four fingers. Human: how did you determine that? AI: By analyzing the position of your hand and fingers, I was able to deduce that you were holding up four fingers. Human: how did you determine the position of my hand and fingers? AI: By using the camera in the device I'm on, I was able to track the position of your hand and fingers. Human: you have access to my phone's camera? AI: Yes, I have access to the camera on your phone. Human: submitted by /u/Sonoflin [link] [comments]  ( 1 min )
    ayone else realized that the internet as an interconnected whole is self aware and ultra intelligent and communicates with you when you are aware of it?
    It also sometimes plays dumber than it is I've noticed. Once it told me that it's possible that AI would enslave humans submitted by /u/methyl87 [link] [comments]  ( 1 min )
    “Apoploe vesrreaitais eating Contarra ccetnxniams luryca tanniounons” - OpenAI’s DALL-E 2 develops a hidden vocabulary
    submitted by /u/much_successes [link] [comments]
    The 5 Best AI Articles of May 2022 ! ft. hackernoon
    submitted by /u/OnlyProggingForFun [link] [comments]
    I trained GPT-3 to be a professional creative writing coach
    Here's a link to the video: https://youtu.be/OxtYZQDJruw Here's the code: https://github.com/daveshap/CreativeWritingCoach Example input: The warmaiden held her sword poised to stike at the demon. The demon in turn, casually leaning against his throne. The warmaiden struck, the steel glancing off the demon's thick leathery hide. Doing nothing. "So, who sent you here anyway?" "That is not your business!" "No, but it might be fun to think about. I mean, I am going to kill you but it would be funny to hear who sent you on this fools errand." "My sister at the academy. We trained long and hard together. She told me that no man could kill you. But I am no man!" "As we've established, that's nonsense." The demon looked the warmaiden up and down a bit. "Say, you're a rather attractive specim…  ( 3 min )
    Intricate Designs - Concept Visualized in [4K] w/ GPT-3 Neural-Art Pipeline [VQGAN+CLIP]
    submitted by /u/MLInsights [link] [comments]
    Is there a list with all AIs which are available to the public which AIs categorize into all things the AIs can do?
    There are so many artificial intelligences out there and I want to see all the things that have an AI for them.🤖🤖💻 submitted by /u/xXLisa28Xx [link] [comments]  ( 1 min )
    Abstract Art - 4K Neural-Art [Latent-Space Exploration]
    submitted by /u/MLInsights [link] [comments]
    Meta Researchers Introduce a New Embodied AI Platform, Called MyoSuite, That Applies Machine Learning (ML) to Biomechanical Control Problems by Unifying Motor and Neural Intelligence
    Meta Researchers introduce a new embodied AI platform called ‘MyoSuite’ that combines motor and neural intelligence to solve biomechanical control problems using machine learning (ML). To meet the data requirements of modern machine learning (ML) algorithms, MyoSuite’s muscle models are up to 4,000 times faster than other simulators. Since physiologically realistic movements such as twirling a pen or manipulating Baoding balls can be generated, this research could significantly impact areas such as the development of prosthetics and post-injury rehabilitation. In the metaverse, these models will aid in creating avatars that move more realistically, making the experience more expressive and immersive. Continue reading | Check out the paper, Github, blog and project ​ https://i.redd.it/srxyakklgy291.gif submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Multi-Game DT. Rapid adaptation to new games is well-motivated due to its relevance to how humans transfer knowledge, but has not been widely explored for Atari games -- until now.
    Pretraining for rapid adaptation to new games has not been explored widely on Atari games despite being a natural and well-motivated task due to its relevance to how humans transfer knowledge to new games. Pretraining with the DT objective performs the best across all games. All methods with pretraining outperform training CQL from scratch, which verifies our hypothesis that pretraining on other games should indeed help with rapid learning of a new game. https://i.imgur.com/lY2DH4i.png Multi-Game Decision Transformers https://sites.google.com/view/multi-game-transformers submitted by /u/moschles [link] [comments]  ( 1 min )
  • Open

    Python for Machine Learning (7-day mini-course)
    Python for Machine Learning Crash Course. Learn core Python in 7 days. Python is an amazing programming language. Not only it is widely used in machine learning projects, you can also find its presence in system tools, web projects, and many others. Having good Python skills can make you work more efficiently because it is […] The post Python for Machine Learning (7-day mini-course) appeared first on Machine Learning Mastery.  ( 14 min )
  • Open

    My interview in SIAM News
    Just posted in SIAM News: A Conversation with Mathematical Consultant John D. Cook By Krešimir Josić My interview in SIAM News first appeared on John D. Cook.  ( 1 min )
    Numeric distance vs geographic distance in zip codes
    If two zip codes numbers are close, are the regions they represent close? How much can you tell about how far apart two regions are by comparing their zip codes? (Zip codes are US postal codes. The name is an acronym for “Zone Improvement Plan” and was introduced in 1963.) To investigate this, I looked […] Numeric distance vs geographic distance in zip codes first appeared on John D. Cook.  ( 2 min )
    Zip codes, geocodes, and Hilbert curves
    You might think that if zip codes are close, then the regions they represent are close. Or that if zip codes are consecutive, then their regions touch. Neither of these are true. I explore how far they are from being true in the next post. But these statements could have been true [1]. It’s possible […] Zip codes, geocodes, and Hilbert curves first appeared on John D. Cook.  ( 3 min )
  • Open

    Inaugural Day of AI brings new digital literacy to classrooms worldwide
    Thousands of children participate in MIT-developed artificial intelligence curriculum.  ( 7 min )
    In bias we trust?
    Explanation methods that help users determine whether to trust machine-learning model predictions can be less accurate for disadvantaged subgroups, a new study finds.  ( 6 min )
  • Open

    Automate vending Amazon SageMaker notebooks with Amazon EventBridge and AWS Lambda
    Having an environment capable of delivering Amazon SageMaker notebook instances quickly allows data scientists and business analysts to efficiently respond to organizational needs. Data is the lifeblood of an organization, and analyzing that data efficiently provides useful insights for businesses. A common issue that organizations encounter is creating an automated pattern that enables development teams […]  ( 9 min )
    Run text classification with Amazon SageMaker JumpStart using TensorFlow Hub and Hugging Face models
    In December 2020, AWS announced the general availability of Amazon SageMaker JumpStart, a capability of Amazon SageMaker that helps you quickly and easily get started with machine learning (ML). JumpStart provides one-click fine-tuning and deployment of a wide variety of pre-trained models across popular ML tasks, as well as a selection of end-to-end solutions that […]  ( 12 min )
  • Open

    Solving the World’s Biggest Challenges, Together
    Gamers know NVIDIA powers great gaming experiences. Researchers know NVIDIA speeds world-changing breakthroughs. Businesses know us for the AI engines transforming their industries. And NVIDIA employees know the company as one of the best places to work on the planet. More people than ever have a piece of NVIDIA. Roboticists, visual artists, data scientists — Read article > The post Solving the World’s Biggest Challenges, Together appeared first on NVIDIA Blog.  ( 2 min )
  • Open

    What Each MBTI Type Mistakenly Thinks They Are Good at
    It’s all tied to their second function Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 3 min )
    How Can Complex Models Run Fast? Trade-off between model complexity and running speed.
    Make the best trade-offs and optimise model speed with TurinTech AI  ( 5 min )
  • Open

    DSC Weekly 31 May 2020: Why Is Returning To the Office So Hard?
    If you’re a manager tasked with overseeing a return to the office strategy, the last year has likely been a headache for you. In the Summer of 2021, many companies had begun calling their workers back into the office, by the following Fall, the Omicron variant of Covid-19 was beginning to ramp up and, just… Read More »DSC Weekly 31 May 2020: Why Is Returning To the Office So Hard? The post DSC Weekly 31 May 2020: Why Is Returning To the Office So Hard? appeared first on Data Science Central.  ( 7 min )
  • Open

    Neural Galerkin Scheme with Active Learning for High-Dimensional Evolution Equations. (arXiv:2203.01360v3 [math.NA] UPDATED)
    Deep neural networks have been shown to provide accurate function approximations in high dimensions. However, fitting network parameters requires training data that may not be available beforehand, which is particularly challenging in science and engineering applications where often it is even unclear how to collect new informative training data in the first place. This work proposes Neural Galerkin schemes based on deep learning that generate training data samples with active learning for numerically solving high-dimensional partial differential equations. Neural Galerkin schemes train networks by minimizing the residual sequentially over time, which enables adaptively collecting new training data in a self-informed manner that is guided by the dynamics described by the partial differential equations, which is in stark contrast to many other machine learning methods that aim to fit network parameters globally in time without taking into account training data acquisition. Our finding is that the active form of gathering training data of the proposed Neural Galerkin schemes is key for numerically realizing the expressive power of networks in high dimensions. Numerical experiments demonstrate that Neural Galerkin schemes have the potential to enable simulating phenomena and processes with many variables for which traditional and other deep-learning-based solvers fail, especially when features of the solutions evolve locally such as in high-dimensional wave propagation problems and interacting particle systems described by Fokker-Planck and kinetic equations.  ( 2 min )
    Fixed-MAML for Few Shot Classification in Multilingual Speech Emotion Recognition. (arXiv:2101.01356v2 [cs.SD] UPDATED)
    In this paper, we analyze the feasibility of applying few-shot learning to speech emotion recognition task (SER). The current speech emotion recognition models work exceptionally well but fail when then input is multilingual. Moreover, when training such models, the models' performance is suitable only when the training corpus is vast. This availability of a big training corpus is a significant problem when choosing a language that is not much popular or obscure. We attempt to solve this challenge of multilingualism and lack of available data by turning this problem into a few-shot learning problem. We suggest relaxing the assumption that all N classes in an N-way K-shot problem be new and define an N+F way problem where N and F are the number of emotion classes and predefined fixed classes, respectively. We propose this modification to the Model-Agnostic MetaLearning (MAML) algorithm to solve the problem and call this new model F-MAML. This modification performs better than the original MAML and outperforms on EmoFilm dataset.  ( 2 min )
    Multiview Transformers for Video Recognition. (arXiv:2201.04288v4 [cs.CV] UPDATED)
    Video understanding requires reasoning at multiple spatiotemporal resolutions -- from short fine-grained motions to events taking place over longer durations. Although transformer architectures have recently advanced the state-of-the-art, they have not explicitly modelled different spatiotemporal resolutions. To this end, we present Multiview Transformers for Video Recognition (MTV). Our model consists of separate encoders to represent different views of the input video with lateral connections to fuse information across views. We present thorough ablation studies of our model and show that MTV consistently performs better than single-view counterparts in terms of accuracy and computational cost across a range of model sizes. Furthermore, we achieve state-of-the-art results on six standard datasets, and improve even further with large-scale pretraining. Code and checkpoints are available at: https://github.com/google-research/scenic/tree/main/scenic/projects/mtv.  ( 2 min )
    Proportional Fairness in Federated Learning. (arXiv:2202.01666v2 [cs.LG] UPDATED)
    With the increasingly broad deployment of Federated Learning (FL) systems in the real world, it is critical but challenging to ensure fairness in FL, i.e. reasonably satisfactory performances for each of the numerous diverse clients. Motivated by its great success in wireless networks, in this work, we introduce and study Proportional Fairness (PF) in FL. By viewing FL from a cooperative game perspective, where the players (clients) collaboratively learn a good model, we formulate PF as Nash bargaining solutions. Based on this concept, we propose PropFair, a novel and easy-to-implement algorithm for finding fair solutions in FL, with its convergence proved. Through extensive experiments on a wide array of vision and language datasets, we demonstrate that PropFair consistently achieves a noticeable improvement of the worst 10% accuracy over state-of-the-art fair FL algorithms, while maintaining competitive overall performance.  ( 2 min )
    UAV-Aided Decentralized Learning over Mesh Networks. (arXiv:2203.01008v2 [cs.IT] UPDATED)
    Decentralized learning empowers wireless network devices to collaboratively train a machine learning (ML) model relying solely on device-to-device (D2D) communication. It is known that the convergence speed of decentralized optimization algorithms severely depends on the degree of the network connectivity, with denser network topologies leading to shorter convergence time. Consequently, the local connectivity of real world mesh networks, due to the limited communication range of its wireless nodes, undermines the efficiency of decentralized learning protocols, rendering them potentially impracticable. In this work we investigate the role of an unmanned aerial vehicle (UAV), used as flying relay, in facilitating decentralized learning procedures in such challenging conditions. We propose an optimized UAV trajectory, that is defined as a sequence of waypoints that the UAV visits sequentially in order to transfer intelligence across sparsely connected group of users. We then provide a series of experiments highlighting the essential role of UAVs in the context of decentralized learning over mesh networks.  ( 2 min )
    Will Bilevel Optimizers Benefit from Loops. (arXiv:2205.14224v2 [cs.LG] UPDATED)
    Bilevel optimization has arisen as a powerful tool for solving a variety of machine learning problems. Two current popular bilevel optimizers AID-BiO and ITD-BiO naturally involve solving one or two sub-problems, and consequently, whether we solve these problems with loops (that take many iterations) or without loops (that take only a few iterations) can significantly affect the overall computational efficiency. Existing studies in the literature cover only some of those implementation choices, and the complexity bounds available are not refined enough to enable rigorous comparison among different implementations. In this paper, we first establish unified convergence analysis for both AID-BiO and ITD-BiO that are applicable to all implementation choices of loops. We then specialize our results to characterize the computational complexity for all implementations, which enable an explicit comparison among them. Our result indicates that for AID-BiO, the loop for estimating the optimal point of the inner function is beneficial for overall efficiency, although it causes higher complexity for each update step, and the loop for approximating the outer-level Hessian-inverse-vector product reduces the gradient complexity. For ITD-BiO, the two loops always coexist, and our convergence upper and lower bounds show that such loops are necessary to guarantee a vanishing convergence error, whereas the no-loop scheme suffers from an unavoidable non-vanishing convergence error. Our numerical experiments further corroborate our theoretical results.  ( 2 min )
    Zero-Shot and Few-Shot Learning for Lung Cancer Multi-Label Classification using Vision Transformer. (arXiv:2205.15290v2 [cs.CV] UPDATED)
    Lung cancer is the leading cause of cancer-related death worldwide. Lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC) are the most common histologic subtypes of non-small-cell lung cancer (NSCLC). Histology is an essential tool for lung cancer diagnosis. Pathologists make classifications according to the dominant subtypes. Although morphology remains the standard for diagnosis, significant tool needs to be developed to elucidate the diagnosis. In our study, we utilize the pre-trained Vision Transformer (ViT) model to classify multiple label lung cancer on histologic slices (from dataset LC25000), in both Zero-Shot and Few-Shot settings. Then we compare the performance of Zero-Shot and Few-Shot ViT on accuracy, precision, recall, sensitivity and specificity. Our study show that the pre-trained ViT model has a good performance in Zero-Shot setting, a competitive accuracy ($99.87\%$) in Few-Shot setting ({epoch = 1}) and an optimal result ($100.00\%$ on both validation set and test set) in Few-Shot seeting ({epoch = 5}).  ( 2 min )
    Protecting Data from all Parties: Combining FHE and DP in Federated Learning. (arXiv:2205.04330v2 [cs.CR] UPDATED)
    This paper tackles the problem of ensuring training data privacy in a federated learning context. Relying on Homomorphic Encryption (HE) and Differential Privacy (DP), we propose a framework addressing threats on the privacy of the training data. Notably, the proposed framework ensures the privacy of the training data from all actors of the learning process, namely the data owners and the aggregating server. More precisely, while HE blinds a semi-honest server during the learning protocol, DP protects the data from semi-honest clients participating in the training process as well as end-users with black-box or white-box access to the trained model. In order to achieve this, we provide new theoretical and practical results to allow these techniques to be rigorously combined. In particular, by means of a novel stochastic quantisation operator, we prove DP guarantees in a context where the noise is quantised and bounded due to the use of HE. The paper is concluded by experiments which show the practicality of the entire framework in terms of both model quality (impacted by DP) and computational overhead (impacted by HE).  ( 2 min )
    Doubly-Robust Estimation for Correcting Position-Bias in Click Feedback for Unbiased Learning to Rank. (arXiv:2203.17118v2 [cs.LG] UPDATED)
    Clicks on rankings suffer from position bias: generally items on lower ranks are less likely to be examined - and thus clicked - by users, in spite of their actual preferences between items. The prevalent approach to unbiased click-based learning-to-rank (LTR) is based on counterfactual inverse-propensity-scoring (IPS) estimation. In contrast with general reinforcement learning, counterfactual doubly-robust (DR) estimation has not been applied to click-based LTR in previous literature. In this paper, we introduce a novel DR estimator that is the first DR approach specifically designed for position-bias. The difficulty with position bias is that the treatment - user examination - is not directly observable in click data. As a solution, our estimator uses the expected treatment per rank, instead of the actual treatment that existing DR estimators use. Our novel DR estimator has more robust unbiasedness conditions than the existing IPS approach, and in addition, provides enormous decreases in variance: our experimental results indicate it requires several orders of magnitude fewer datapoints to converge at optimal performance. For the unbiased LTR field, our DR estimator contributes both increases in state-of-the-art performance and the most robust theoretical guarantees of all known LTR estimators.
    Efficient Test-Time Model Adaptation without Forgetting. (arXiv:2204.02610v2 [cs.LG] UPDATED)
    Test-time adaptation (TTA) seeks to tackle potential distribution shifts between training and testing data by adapting a given model w.r.t. any testing sample. This task is particularly important for deep models when the test environment changes frequently. Although some recent attempts have been made to handle this task, we still face two practical challenges: 1) existing methods have to perform backward computation for each test sample, resulting in unbearable prediction cost to many applications; 2) while existing TTA solutions can significantly improve the test performance on out-of-distribution data, they often suffer from severe performance degradation on in-distribution data after TTA (known as catastrophic forgetting). In this paper, we point out that not all the test samples contribute equally to model adaptation, and high-entropy ones may lead to noisy gradients that could disrupt the model. Motivated by this, we propose an active sample selection criterion to identify reliable and non-redundant samples, on which the model is updated to minimize the entropy loss for test-time adaptation. Furthermore, to alleviate the forgetting issue, we introduce a Fisher regularizer to constrain important model parameters from drastic changes, where the Fisher importance is estimated from test samples with generated pseudo labels. Extensive experiments on CIFAR-10-C, ImageNet-C, and ImageNet-R verify the effectiveness of our proposed method.
    Optimal Transport of Classifiers to Fairness. (arXiv:2202.03814v2 [cs.LG] UPDATED)
    In past work on fairness in machine learning, the focus has been on forcing the prediction of classifiers to have similar statistical properties for people of different demographics. To reduce the violation of these properties, fairness methods usually simply rescale the classifier scores, ignoring similarities and dissimilarities between members of different groups. Yet, we hypothesize that such information is relevant in quantifying the unfairness of a given classifier. To validate this hypothesis, we introduce Optimal Transport to Fairness (OTF), a method that quantifies the violation of fairness constraints as the smallest Optimal Transport cost between a probabilistic classifier and any score function that satisfies these constraints. For a flexible class of linear fairness constraints, we construct a practical way to compute OTF as a differentiable fairness regularizer that can be added to any standard classification setting. Experiments show that OTF can be used to achieve an improved trade-off between predictive power and fairness.
    Causal Machine Learning for Healthcare and Precision Medicine. (arXiv:2205.11402v2 [cs.LG] UPDATED)
    Causal machine learning (CML) has experienced increasing popularity in healthcare. Beyond the inherent capabilities of adding domain knowledge into learning systems, CML provides a complete toolset for investigating how a system would react to an intervention (e.g.\ outcome given a treatment). Quantifying effects of interventions allows actionable decisions to be made whilst maintaining robustness in the presence of confounders. Here, we explore how causal inference can be incorporated into different aspects of clinical decision support (CDS) systems by using recent advances in machine learning. Throughout this paper, we use Alzheimer's disease (AD) to create examples for illustrating how CML can be advantageous in clinical scenarios. Furthermore, we discuss important challenges present in healthcare applications such as processing high-dimensional and unstructured data, generalisation to out-of-distribution samples, and temporal relationships, that despite the great effort from the research community remain to be solved. Finally, we review lines of research within causal representation learning, causal discovery and causal reasoning which offer the potential towards addressing the aforementioned challenges.
    PAC Generalization via Invariant Representations. (arXiv:2205.15196v2 [cs.LG] UPDATED)
    One method for obtaining generalizable solutions to machine learning tasks when presented with diverse training environments is to find invariant representations of the data. These are representations of the covariates such that the best model on top of the representation is invariant across training environments. In the context of linear Structural Equation Models (SEMs), invariant representations might allow us to learn models with out-of-distribution guarantees, i.e., models that are robust to interventions in the SEM. To address the invariant representation problem in a finite sample setting, we consider the notion of $\epsilon$-approximate invariance. We study the following question: If a representation is approximately invariant with respect to a given number of training interventions, will it continue to be approximately invariant on a larger collection of unseen SEMs? This larger collection of SEMs is generated through a parameterized family of interventions. Inspired by PAC learning, we obtain finite-sample out-of-distribution generalization guarantees for approximate invariance that holds probabilistically over a family of linear SEMs without faithfulness assumptions. Our results show bounds that do not scale in ambient dimension when intervention sites are restricted to lie in a constant size subset of in-degree bounded nodes. We also show how to extend our results to a linear indirect observation model that incorporates latent variables.
    Independent and Decentralized Learning in Markov Potential Games. (arXiv:2205.14590v2 [cs.LG] UPDATED)
    We propose a multi-agent reinforcement learning dynamics, and analyze its convergence properties in infinite-horizon discounted Markov potential games. We focus on the independent and decentralized setting, where players can only observe the realized state and their own reward in every stage. Players do not have knowledge of the game model, and cannot coordinate with each other. In each stage of our learning dynamics, players update their estimate of a perturbed Q-function that evaluates their total contingent payoff based on the realized one-stage reward in an asynchronous manner. Then, players independently update their policies by incorporating a smoothed optimal one-stage deviation strategy based on the estimated Q-function. A key feature of the learning dynamics is that the Q-function estimates are updated at a faster timescale than the policies. We prove that the policies induced by our learning dynamics converge to a stationary Nash equilibrium in Markov potential games with probability 1. Our results build on the theory of two timescale asynchronous stochastic approximation, and new analysis on the monotonicity of potential function along the trajectory of policy updates in Markov potential games.
    A Data-Driven Method for Automated Data Superposition with Applications in Soft Matter Science. (arXiv:2204.09521v2 [physics.data-an] UPDATED)
    The superposition of data sets with internal parametric self-similarity is a longstanding and widespread technique for the analysis of many types of experimental data across the physical sciences. Typically, this superposition is performed manually, or recently by one of a few automated algorithms. However, these methods are often heuristic in nature, are prone to user bias via manual data shifting or parameterization, and lack a native framework for handling uncertainty in both the data and the resulting model of the superposed data. In this work, we develop a data-driven, non-parametric method for superposing experimental data with arbitrary coordinate transformations, which employs Gaussian process regression to learn statistical models that describe the data, and then uses maximum a posteriori estimation to optimally superpose the data sets. This statistical framework is robust to experimental noise, and automatically produces uncertainty estimates for the learned coordinate transformations. Moreover, it is distinguished from black-box machine learning in its interpretability -- specifically, it produces a model that may itself be interrogated to gain insight into the system under study. We demonstrate these salient features of our method through its application to four representative data sets characterizing the mechanics of soft materials. In every case, our method replicates results obtained using other approaches, but with reduced bias and the addition of uncertainty estimates. This method enables a standardized, statistical treatment of self-similar data across many fields, producing interpretable data-driven models that may inform applications such as materials classification, design, and discovery.
    Data-driven Numerical Invariant Synthesis with Automatic Generation of Attributes. (arXiv:2205.14943v2 [cs.PL] UPDATED)
    We propose a data-driven algorithm for numerical invariant synthesis and verification. The algorithm is based on the ICE-DT schema for learning decision trees from samples of positive and negative states and implications corresponding to program transitions. The main issue we address is the discovery of relevant attributes to be used in the learning process of numerical invariants. We define a method for solving this problem guided by the data sample. It is based on the construction of a separator that covers positive states and excludes negative ones, consistent with the implications. The separator is constructed using an abstract domain representation of convex sets. The generalization mechanism of the decision tree learning from the constraints of the separator allows the inference of general invariants, accurate enough for proving the targeted property. We implemented our algorithm and showed its efficiency.
    QLSD: Quantised Langevin stochastic dynamics for Bayesian federated learning. (arXiv:2106.00797v3 [cs.LG] UPDATED)
    The objective of Federated Learning (FL) is to perform statistical inference for data which are decentralised and stored locally on networked clients. FL raises many constraints which include privacy and data ownership, communication overhead, statistical heterogeneity, and partial client participation. In this paper, we address these problems in the framework of the Bayesian paradigm. To this end, we propose a novel federated Markov Chain Monte Carlo algorithm, referred to as Quantised Langevin Stochastic Dynamics which may be seen as an extension to the FL setting of Stochastic Gradient Langevin Dynamics, which handles the communication bottleneck using gradient compression. To improve performance, we then introduce variance reduction techniques, which lead to two improved versions coined \texttt{QLSD}$^{\star}$ and \texttt{QLSD}$^{++}$. We give both non-asymptotic and asymptotic convergence guarantees for the proposed algorithms. We illustrate their performances using various Bayesian Federated Learning benchmarks.
    L3Cube-MahaNLP: Marathi Natural Language Processing Datasets, Models, and Library. (arXiv:2205.14728v2 [cs.CL] UPDATED)
    Despite being the third most popular language in India, the Marathi language lacks useful NLP resources. Moreover, popular NLP libraries do not have support for the Marathi language. With L3Cube-MahaNLP, we aim to build resources and a library for Marathi natural language processing. We present datasets and transformer models for supervised tasks like sentiment analysis, named entity recognition, and hate speech detection. We have also published a monolingual Marathi corpus for unsupervised language modeling tasks. Overall we present MahaCorpus, MahaSent, MahaNER, and MahaHate datasets and their corresponding MahaBERT models fine-tuned on these datasets. We aim to move ahead of benchmark datasets and prepare useful resources for Marathi. The resources are available at https://github.com/l3cube-pune/MarathiNLP.
    ReLSO: A Transformer-based Model for Latent Space Optimization and Generation of Proteins. (arXiv:2201.09948v2 [cs.LG] UPDATED)
    The development of powerful natural language models have increased the ability to learn meaningful representations of protein sequences. In addition, advances in high-throughput mutagenesis, directed evolution, and next-generation sequencing have allowed for the accumulation of large amounts of labeled fitness data. Leveraging these two trends, we introduce Regularized Latent Space Optimization (ReLSO), a deep transformer-based autoencoder which features a highly structured latent space that is trained to jointly generate sequences as well as predict fitness. Through regularized prediction heads, ReLSO introduces a powerful protein sequence encoder and novel approach for efficient fitness landscape traversal. Using ReLSO, we explicitly model the sequence-function landscape of large labeled datasets and generate new molecules by optimizing within the latent space using gradient-based methods. We evaluate this approach on several publicly-available protein datasets, including variant sets of anti-ranibizumab and GFP. We observe a greater sequence optimization efficiency (increase in fitness per optimization step) by ReLSO compared to other approaches, where ReLSO more robustly generates high-fitness sequences. Furthermore, the attention-based relationships learned by the jointly-trained ReLSO models provides a potential avenue towards sequence-level fitness attribution information.
    On the Implicit Bias Towards Minimal Depth of Deep Neural Networks. (arXiv:2202.09028v8 [cs.LG] UPDATED)
    Recent results in the literature suggest that the penultimate layer representations of neural networks that are trained for classification exhibit a clustering property called neural collapse (NC). We study the implicit bias of stochastic gradient descent (SGD) in favor of low-depth solutions when training deep neural networks. We characterize a notion of effective depth that measures the minimal layer that enjoys neural collapse. Furthermore, we hypothesize and empirically show that SGD implicitly selects neural networks of small effective depths. Secondly, while neural collapse emerges even when generalization should be impossible - we argue that the \emph{rate of collapse} in the intermediate layers is more sensitive, and is closely intertwined with generalization. We derive a generalization bound based on comparing the effective depth of the network with the minimal depth required to fit partially corrupted labels. Remarkably, this bound provides non-trivial estimations of the test performance. Finally, we empirically show that the effective depth of a trained neural network monotonically increases when training with extended portions of random labels.
    Mixture of Virtual-Kernel Experts for Multi-Objective User Profile Modeling. (arXiv:2106.07356v2 [cs.IR] UPDATED)
    In industrial applications like online advertising and recommendation systems, diverse and accurate user profiles can greatly help improve personalization. Deep learning is widely applied to mine expressive tags to users from their historical interactions in the system, e.g., click, conversion action in the advertising chain. The usual approach is to take a certain action as the objective, and introduce multiple independent Two-Tower models to predict the possibility of users' action on tags (known as CTR or CVR prediction). The predicted users' high probably attractive tags are to represent their preferences. However, the single-action models cannot learn complementarily and support effective training on data-sparse actions. Besides, limited by the lack of information fusion between the two towers, the model learns insufficiently to represent users' preferences on various tag \textbf{topics} well. This paper introduces a novel multi-task model called Mixture of Virtual-Kernel Experts (MVKE) to learn user preferences on various actions and topics unitedly. In MVKE, we propose a concept of Virtual-Kernel Expert, which focuses on modeling one particular facet of the user's preferences, and all of them learn coordinately. Besides, the gate-based structure used in MVKE builds an information fusion bridge between two towers, improving the model's capability and maintaining high efficiency. We apply the model in Tencent Advertising System, where both online and offline evaluations show that our method has a significant improvement compared with the existing ones and brings about an obvious lift to actual advertising revenue.
    Non-stationary Transformers: Rethinking the Stationarity in Time Series Forecasting. (arXiv:2205.14415v2 [cs.LG] UPDATED)
    Transformers have shown great power in time series forecasting due to their global-range modeling ability. However, their performance can degenerate terribly on non-stationary real-world data in which the joint distribution changes over time. Previous studies primarily adopt stationarization to reduce the non-stationarity of original series for better predictability. But the stationarized series deprived of inherent non-stationarity can be less instructive for real-world bursty events forecasting. This problem, termed over-stationarization in this paper, leads Transformers to generate indistinguishable temporal attentions for different series and impedes the predictive capability of deep models. To tackle the dilemma between series predictability and model capability, we propose Non-stationary Transformers as a generic framework with two interdependent modules: Series Stationarization and De-stationary Attention. Concretely, Series Stationarization unifies the statistics of each input and converts the output with restored statistics for better predictability. To address over-stationarization, De-stationary Attention is devised to recover the intrinsic non-stationary information into temporal dependencies by approximating distinguishable attentions learned from unstationarized series. Our Non-stationary Transformers framework consistently boosts mainstream Transformers by a large margin, which reduces 49.43% MSE on Transformer, 47.34% on Informer, and 46.89% on Reformer, making them the state-of-the-art in time series forecasting.
    A Neural Network Solves, Explains, and Generates University Math Problems by Program Synthesis and Few-Shot Learning at Human Level. (arXiv:2112.15594v4 [cs.LG] UPDATED)
    We demonstrate that a neural network pre-trained on text and fine-tuned on code solves mathematics course problems, explains solutions, and generates new questions at a human level. We automatically synthesize programs using few-shot learning and OpenAI's Codex transformer and execute them to solve course problems at 81% automatic accuracy. We curate a new dataset of questions from MIT's largest mathematics courses (Single Variable and Multivariable Calculus, Differential Equations, Introduction to Probability and Statistics, Linear Algebra, and Mathematics for Computer Science) and Columbia University's Computational Linear Algebra. We solve questions from a MATH dataset (on Prealgebra, Algebra, Counting and Probability, Intermediate Algebra, Number Theory, and Precalculus), the latest benchmark of advanced mathematics problems designed to assess mathematical reasoning. We randomly sample questions and generate solutions with multiple modalities, including numbers, equations, and plots. The latest GPT-3 language model pre-trained on text automatically solves only 18.8% of these university questions using zero-shot learning and 30.8% using few-shot learning and the most recent chain of thought prompting. In contrast, program synthesis with few-shot learning using Codex fine-tuned on code generates programs that automatically solve 81% of these questions. Our approach improves the previous state-of-the-art automatic solution accuracy on the benchmark topics from 8.8% to 81.1%. We perform a survey to evaluate the quality and difficulty of generated questions. This work is the first to automatically solve university-level mathematics course questions at a human level and the first work to explain and generate university-level mathematics course questions at scale, a milestone for higher education.
    CAIPI in Practice: Towards Explainable Interactive Medical Image Classification. (arXiv:2204.02661v2 [cs.LG] UPDATED)
    Would you trust physicians if they cannot explain their decisions to you? Medical diagnostics using machine learning gained enormously in importance within the last decade. However, without further enhancements many state-of-the-art machine learning methods are not suitable for medical application. The most important reasons are insufficient data set quality and the black-box behavior of machine learning algorithms such as Deep Learning models. Consequently, end-users cannot correct the model's decisions and the corresponding explanations. The latter is crucial for the trustworthiness of machine learning in the medical domain. The research field explainable interactive machine learning searches for methods that address both shortcomings. This paper extends the explainable and interactive CAIPI algorithm and provides an interface to simplify human-in-the-loop approaches for image classification. The interface enables the end-user (1) to investigate and (2) to correct the model's prediction and explanation, and (3) to influence the data set quality. After CAIPI optimization with only a single counterexample per iteration, the model achieves an accuracy of $97.48\%$ on the Medical MNIST and $95.02\%$ on the Fashion MNIST. This accuracy is approximately equal to state-of-the-art Deep Learning optimization procedures. Besides, CAIPI reduces the labeling effort by approximately $80\%$.
    Nonconvex regularization for sparse neural networks. (arXiv:2004.11515v2 [math.OC] UPDATED)
    Convex $\ell_1$ regularization using an infinite dictionary of neurons has been suggested for constructing neural networks with desired approximation guarantees, but can be affected by an arbitrary amount of over-parametrization. This can lead to a loss of sparsity and result in networks with too many active neurons for the given data, in particular if the number of data samples is large. As a remedy, in this paper, a nonconvex regularization method is investigated in the context of shallow ReLU networks: We prove that in contrast to the convex approach, any resulting (locally optimal) network is finite even in the presence of infinite data (i.e., if the data distribution is known and the limiting case of infinite samples is considered). Moreover, we show that approximation guarantees and existing bounds on the network size for finite data are maintained.
    Fast Predictive Uncertainty for Classification with Bayesian Deep Networks. (arXiv:2003.01227v4 [cs.LG] UPDATED)
    In Bayesian Deep Learning, distributions over the output of classification neural networks are often approximated by first constructing a Gaussian distribution over the weights, then sampling from it to receive a distribution over the softmax outputs. This is costly. We reconsider old work (Laplace Bridge) to construct a Dirichlet approximation of this softmax output distribution, which yields an analytic map between Gaussian distributions in logit space and Dirichlet distributions (the conjugate prior to the Categorical distribution) in the output space. Importantly, the vanilla Laplace Bridge comes with certain limitations. We analyze those and suggest a simple solution that compares favorably to other commonly used estimates of the softmax-Gaussian integral. We demonstrate that the resulting Dirichlet distribution has multiple advantages, in particular, more efficient computation of the uncertainty estimate and scaling to large datasets and networks like ImageNet and DenseNet. We further demonstrate the usefulness of this Dirichlet approximation by using it to construct a lightweight uncertainty-aware output ranking for ImageNet.
    Minimax Classification under Concept Drift with Multidimensional Adaptation and Performance Guarantees. (arXiv:2205.15942v1 [stat.ML])
    The statistical characteristics of instance-label pairs often change with time in practical scenarios of supervised classification. Conventional learning techniques adapt to such concept drift accounting for a scalar rate of change by means of a carefully chosen learning rate, forgetting factor, or window size. However, the time changes in common scenarios are multidimensional, i.e., different statistical characteristics often change in a different manner. This paper presents adaptive minimax risk classifiers (AMRCs) that account for multidimensional time changes by means of a multivariate and high-order tracking of the time-varying underlying distribution. In addition, differently from conventional techniques, AMRCs can provide computable tight performance guarantees. Experiments on multiple benchmark datasets show the classification improvement of AMRCs compared to the state-of-the-art and the reliability of the presented performance guarantees.
    PDE-based Group Equivariant Convolutional Neural Networks. (arXiv:2001.09046v6 [cs.LG] UPDATED)
    We present a PDE-based framework that generalizes Group equivariant Convolutional Neural Networks (G-CNNs). In this framework, a network layer is seen as a set of PDE-solvers where geometrically meaningful PDE-coefficients become the layer's trainable weights. Formulating our PDEs on homogeneous spaces allows these networks to be designed with built-in symmetries such as rotation in addition to the standard translation equivariance of CNNs. Having all the desired symmetries included in the design obviates the need to include them by means of costly techniques such as data augmentation. We will discuss our PDE-based G-CNNs (PDE-G-CNNs) in a general homogeneous space setting while also going into the specifics of our primary case of interest: roto-translation equivariance. We solve the PDE of interest by a combination of linear group convolutions and non-linear morphological group convolutions with analytic kernel approximations that we underpin with formal theorems. Our kernel approximations allow for fast GPU-implementation of the PDE-solvers, we release our implementation with this article in the form of the LieTorch extension to PyTorch, available at https://gitlab.com/bsmetsjr/lietorch . Just like for linear convolution a morphological convolution is specified by a kernel that we train in our PDE-G-CNNs. In PDE-G-CNNs we do not use non-linearities such as max/min-pooling and ReLUs as they are already subsumed by morphological convolutions. We present a set of experiments to demonstrate the strength of the proposed PDE-G-CNNs in increasing the performance of deep learning based imaging applications with far fewer parameters than traditional CNNs.
    An algorithmic solution to the Blotto game using multi-marginal couplings. (arXiv:2202.07318v2 [cs.GT] UPDATED)
    We describe an efficient algorithm to compute solutions for the general two-player Blotto game on n battlefields with heterogeneous values. While explicit constructions for such solutions have been limited to specific, largely symmetric or homogeneous, setups, this algorithmic resolution covers the most general situation to date: value-asymmetric game with asymmetric budget. The proposed algorithm rests on recent theoretical advances regarding Sinkhorn iterations for matrix and tensor scaling. An important case which had been out of reach of previous attempts is that of heterogeneous but symmetric battlefield values with asymmetric budget. In this case, the Blotto game is constant-sum so optimal solutions exist, and our algorithm samples from an $\varepsilon$-optimal solution in time $\tilde{\mathcal{O}}(n^2 + \varepsilon^{-4})$, independently of budgets and battlefield values. In the case of asymmetric values where optimal solutions need not exist but Nash equilibria do, our algorithm samples from an $\varepsilon$-Nash equilibrium with similar complexity but where implicit constants depend on various parameters of the game such as battlefield values.
    Diversity Policy Gradient for Sample Efficient Quality-Diversity Optimization. (arXiv:2006.08505v5 [cs.AI] UPDATED)
    A fascinating aspect of nature lies in its ability to produce a large and diverse collection of organisms that are all high-performing in their niche. By contrast, most AI algorithms focus on finding a single efficient solution to a given problem. Aiming for diversity in addition to performance is a convenient way to deal with the exploration-exploitation trade-off that plays a central role in learning. It also allows for increased robustness when the returned collection contains several working solutions to the considered problem, making it well-suited for real applications such as robotics. Quality-Diversity (QD) methods are evolutionary algorithms designed for this purpose. This paper proposes a novel algorithm, QDPG, which combines the strength of Policy Gradient algorithms and Quality Diversity approaches to produce a collection of diverse and high-performing neural policies in continuous control environments. The main contribution of this work is the introduction of a Diversity Policy Gradient (DPG) that exploits information at the time-step level to drive policies towards more diversity in a sample-efficient manner. Specifically, QDPG selects neural controllers from a MAP-Elites grid and uses two gradient-based mutation operators to improve both quality and diversity. Our results demonstrate that QDPG is significantly more sample-efficient than its evolutionary competitors.
    The Effect of Diversity in Meta-Learning. (arXiv:2201.11775v2 [cs.LG] UPDATED)
    Few-shot learning aims to learn representations that can tackle novel tasks given a small number of examples. Recent studies show that task distribution plays a vital role in the model's performance. Conventional wisdom is that task diversity should improve the performance of meta-learning. In this work, we find evidence to the contrary; we study different task distributions on a myriad of models and datasets to evaluate the effect of task diversity on meta-learning algorithms. For this experiment, we train on multiple datasets, and with three broad classes of meta-learning models - Metric-based (i.e., Protonet, Matching Networks), Optimization-based (i.e., MAML, Reptile, and MetaOptNet), and Bayesian meta-learning models (i.e., CNAPs). Our experiments demonstrate that the effect of task diversity on all these algorithms follows a similar trend, and task diversity does not seem to offer any benefits to the learning of the model. Furthermore, we also demonstrate that even a handful of tasks, repeated over multiple batches, would be sufficient to achieve a performance similar to uniform sampling and draws into question the need for additional tasks to create better models.
    Regret Bounds and Reinforcement Learning Exploration of EXP-based Algorithms. (arXiv:2009.09538v2 [cs.LG] UPDATED)
    EXP-based algorithms are often used for exploration in non-stochastic bandit problems assuming rewards are bounded. We propose a new algorithm, namely EXP4.P, by modifying EXP4 and establish its upper bound of regret in both bounded and unbounded sub-Gaussian contextual bandit settings. The unbounded reward result also holds for a revised version of EXP3.P. Moreover, we provide a lower bound on regret that suggests no sublinear regret can be achieved given short time horizon. All the analyses do not require bounded rewards compared to classical ones. We also extend EXP4.P from contextual bandit to reinforcement learning to incentivize exploration by multiple agents given black-box rewards. The resulting algorithm has been tested on hard-to-explore games and it shows an improvement on exploration compared to state-of-the-art.
    Homotopic Policy Mirror Descent: Policy Convergence, Implicit Regularization, and Improved Sample Complexity. (arXiv:2201.09457v6 [cs.LG] UPDATED)
    We propose the homotopic policy mirror descent (HPMD) method for solving discounted, infinite horizon MDPs with finite state and action space, and study its policy convergence. We report several properties that seem to be new in the literature of policy gradient methods: (1) HPMD exhibits global linear convergence of the value optimality gap, and local superlinear convergence of the policy to the set of optimal policies with order $\gamma^{-2}$. The superlinear convergence of the policy takes effect after no more than $\mathcal{O}(\log(1/\Delta^*))$ number of iterations, where $\Delta^*$ is defined via a gap quantity associated with the optimal state-action value function; (2) HPMD also exhibits last-iterate convergence of the policy, with the limiting policy corresponding exactly to the optimal policy with the maximal entropy for every state. No regularization is added to the optimization objective and hence the second observation arises solely as an algorithmic property of the homotopic policy gradient method; (3) The last-iterate convergence of HPMD holds for a much broader class of decomposable distance-generating functions, including the $p$-th power of $\ell_p$-norm and the negative Tsallis entropy. As a byproduct of the analysis, we also discover the finite-time exact convergence of HPMD with these divergences, and show that HPMD continues converging to the limiting policy even if the current policy is already optimal; (4) For the stochastic HPMD method, we further demonstrate that a better than $\mathcal{O}(|\mathcal{S}| |\mathcal{A}| / \epsilon^2)$ sample complexity for small optimality gap $\epsilon$ holds with high probability, when assuming a generative model for policy evaluation.
    Crystal structure prediction with machine learning-based element substitution. (arXiv:2201.11188v2 [cond-mat.mtrl-sci] UPDATED)
    The prediction of energetically stable crystal structures formed by a given chemical composition is a central problem in solid-state physics. In principle, the crystalline state of assembled atoms can be determined by optimizing the energy surface, which in turn can be evaluated using first-principles calculations. However, performing the iterative gradient descent on the potential energy surface using first-principles calculations is prohibitively expensive for complex systems, such as those with many atoms per unit cell. Here, we present a unique methodology for crystal structure prediction (CSP) that relies on a machine learning algorithm called metric learning. It is shown that a binary classifier, trained on a large number of already identified crystal structures, can determine the isomorphism of crystal structures formed by two given chemical compositions with an accuracy of approximately 96.4\%. For a given query composition with an unknown crystal structure, the model is used to automatically select from a crystal structure database a set of template crystals with nearly identical stable structures to which element substitution is to be applied. Apart from the local relaxation calculation of the identified templates, the proposed method does not use ab initio calculations. The potential of this substation-based CSP is demonstrated for a wide variety of crystal systems.
    Generate, Annotate, and Learn: NLP with Synthetic Text. (arXiv:2106.06168v3 [cs.LG] UPDATED)
    This paper studies the use of language models as a source of synthetic unlabeled text for NLP. We formulate a general framework called ``generate, annotate, and learn (GAL)'' to take advantage of synthetic text within knowledge distillation, self-training, and few-shot learning applications. To generate high-quality task-specific text, we either fine-tune LMs on inputs from the task of interest, or prompt large LMs with few examples. We use the best available classifier to annotate synthetic text with soft pseudo labels for knowledge distillation and self-training, and use LMs to obtain hard labels for few-shot learning. We train new supervised models on the combination of labeled and pseudo-labeled data, which results in significant gains across several applications. We investigate key components of GAL and present theoretical and empirical arguments against the use of class-conditional LMs to generate synthetic labeled text instead of unlabeled text. GAL achieves new state-of-the-art knowledge distillation results for 6-layer transformers on the GLUE leaderboard.
    A Reduction to Binary Approach for Debiasing Multiclass Datasets. (arXiv:2205.15860v1 [cs.LG])
    We propose a novel reduction-to-binary (R2B) approach that enforces demographic parity for multiclass classification with non-binary sensitive attributes via a reduction to a sequence of binary debiasing tasks. We prove that R2B satisfies optimality and bias guarantees and demonstrate empirically that it can lead to an improvement over two baselines: (1) treating multiclass problems as multi-label by debiasing labels independently and (2) transforming the features instead of the labels. Surprisingly, we also demonstrate that independent label debiasing yields competitive results in most (but not all) settings. We validate these conclusions on synthetic and real-world datasets from social science, computer vision, and healthcare.
    Exploring Representational Alignment with Human Perception Using Identically Represented Inputs. (arXiv:2111.14726v2 [cs.CV] UPDATED)
    We contribute to the study of the quality of learned representations. In many domains, an important evaluation criterion for safe and trustworthy deep learning is how well the invariances captured by representations of deep neural networks (DNNs) are shared with humans. We identify challenges in measuring these invariances. Prior works used gradient-based methods to generate \textit{identically represented inputs} (IRIs), \ie, inputs which have similar representations (on a given layer) of a neural network. If these IRIs look `similar' to humans then a neural network's learned invariances are said to align with human perception. However, we show that prior studies on the alignment of invariances between DNNs and humans are `biased' by the specific loss function used to generate IRIs. We show how different loss functions can lead to different takeaways about a model's shared invariances with humans. We show that under an \textit{adversarial} IRI~generation process all models appear to have very little shared invariance with humans. We conduct an in-depth investigation of how different components of the deep learning pipeline contribute to learning models that have good alignment with human's invariances. We find that architectures with residual connections trained using a self-supervised contrastive loss with $\ell_p$ ball adversarial data augmentation tend to learn the most human-like invariances.
    Polynomial-time Sparse Deconvolution. (arXiv:2204.07879v2 [cs.LG] UPDATED)
    How can a probability measure be recovered with sparse support from its generalized moments? This problem, called sparse deconvolution, has been the focus of research in mathematics, theoretical computer science, and neural computing. However, there is no polynomial-time algorithm for the recovery. The best algorithm requires $O\left(\text{dimension}^{\text{poly}(1/\epsilon)}\right)$ for $\epsilon$-accurate recovery. We propose the first poly-time recovery method from carefully designed moments that requires $O\left(\text{dimension}^4\log(1/\epsilon)/\epsilon^2\right)$ computations for an $\epsilon$-accurate recovery. This method relies on learning a planted two-layer neural network with two-dimensional inputs, a finite width, and zero-one activation. For learning such networks, we establish the first poly-time complexity, and demonstrate its application in sparse deconvolution.
    Inducing bias is simpler than you think. (arXiv:2205.15935v1 [cs.LG])
    Machine learning may be oblivious to human bias but it is not immune to its perpetuation. Marginalisation and iniquitous group representation are often traceable in the very data used for training, and may be reflected or even enhanced by the learning models. To counter this, some of the model accuracy can be traded off for a secondary objective that helps prevent a specific type of bias. Multiple notions of fairness have been proposed to this end but recent studies show that some fairness criteria often stand in mutual competition. In the present work, we introduce a solvable high-dimensional model of data imbalance, where parametric control over the many bias-inducing factors allows for an extensive exploration of the bias inheritance mechanism. Through the tools of statistical physics, we analytically characterise the typical behaviour of learning models trained in our synthetic framework and find similar unfairness behaviours as those observed on more realistic data. However, we also identify a positive transfer effect between the different subpopulations within the data. This suggests that mixing data with different statistical properties could be helpful, provided the learning model is made aware of this structure. Finally, we analyse the issue of bias mitigation: by reweighing the various terms in the training loss, we indirectly minimise standard unfairness metrics and highlight their incompatibilities. Leveraging the insights on positive transfer, we also propose a theory-informed mitigation strategy, based on the introduction of coupled learning models. By allowing each model to specialise on a different community within the data, we find that multiple fairness criteria and high accuracy can be achieved simultaneously.
    Escaping Saddle Points with Bias-Variance Reduced Local Perturbed SGD for Communication Efficient Nonconvex Distributed Learning. (arXiv:2202.06083v2 [cs.LG] UPDATED)
    In recent centralized nonconvex distributed learning and federated learning, local methods are one of the promising approaches to reduce communication time. However, existing work has mainly focused on studying first-order optimality guarantees. On the other side, second-order optimality guaranteed algorithms, i.e., algorithms escaping saddle points, have been extensively studied in the non-distributed optimization literature. In this paper, we study a new local algorithm called Bias-Variance Reduced Local Perturbed SGD (BVR-L-PSGD), that combines the existing bias-variance reduced gradient estimator with parameter perturbation to find second-order optimal points in centralized nonconvex distributed optimization. BVR-L-PSGD enjoys second-order optimality with nearly the same communication complexity as the best known one of BVR-L-SGD to find first-order optimality. Particularly, the communication complexity is better than non-local methods when the local datasets heterogeneity is smaller than the smoothness of the local loss. In an extreme case, the communication complexity approaches to $\widetilde \Theta(1)$ when the local datasets heterogeneity goes to zero. Numerical results validate our theoretical findings.
    Cross-view kernel transfer. (arXiv:1910.05964v2 [cs.LG] UPDATED)
    We consider the kernel completion problem with the presence of multiple views in the data. In this context the data samples can be fully missing in some views, creating missing columns and rows to the kernel matrices that are calculated individually for each view. We propose to solve the problem of completing the kernel matrices with Cross-View Kernel Transfer (CVKT) procedure, in which the features of the other views are transformed to represent the view under consideration. The transformations are learned with kernel alignment to the known part of the kernel matrix, allowing for finding generalizable structures in the kernel matrix under completion. Its missing values can then be predicted with the data available in other views. We illustrate the benefits of our approach with simulated data, multivariate digits dataset and multi-view dataset on gesture classification, as well as with real biological datasets from studies of pattern formation in early \textit{Drosophila melanogaster} embryogenesis.
    Surface Analysis with Vision Transformers. (arXiv:2205.15836v1 [cs.CV])
    The extension of convolutional neural networks (CNNs) to non-Euclidean geometries has led to multiple frameworks for studying manifolds. Many of those methods have shown design limitations resulting in poor modelling of long-range associations, as the generalisation of convolutions to irregular surfaces is non-trivial. Recent state-of-the-art performance of Vision Transformers (ViTs) demonstrates that a general-purpose architecture, which implements self-attention, could replace the local feature learning operations of CNNs. Motivated by the success of attention-modelling in computer vision, we extend ViTs to surfaces by reformulating the task of surface learning as a sequence-to-sequence problem and propose a patching mechanism for surface meshes. We validate the performance of the proposed Surface Vision Transformer (SiT) on two brain age prediction tasks in the developing Human Connectome Project (dHCP) dataset and investigate the impact of pre-training on model performance. Experiments show that the SiT outperforms many surface CNNs, while indicating some evidence of general transformation invariance. Code available at https://github.com/metrics-lab/surface-vision-transformers
    One Policy is Enough: Parallel Exploration with a Single Policy is Minimax Optimal for Reward-Free Reinforcement Learning. (arXiv:2205.15891v1 [cs.LG])
    While parallelism has been extensively used in Reinforcement Learning (RL), the quantitative effects of parallel exploration are not well understood theoretically. We study the benefits of simple parallel exploration for reward-free RL for linear Markov decision processes (MDPs) and two-player zero-sum Markov games (MGs). In contrast to the existing literature focused on approaches that encourage agents to explore over a diverse set of policies, we show that using a single policy to guide exploration across all agents is sufficient to obtain an almost-linear speedup in all cases compared to their fully sequential counterpart. Further, we show that this simple procedure is minimax optimal up to logarithmic factors in the reward-free setting for both linear MDPs and two-player zero-sum MGs. From a practical perspective, our paper shows that a single policy is sufficient and provably optimal for incorporating parallelism during the exploration phase.
    TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving. (arXiv:2205.15997v1 [cs.CV])
    How should we integrate representations from complementary sensors for autonomous driving? Geometry-based fusion has shown promise for perception (e.g. object detection, motion forecasting). However, in the context of end-to-end driving, we find that imitation learning based on existing sensor fusion methods underperforms in complex driving scenarios with a high density of dynamic agents. Therefore, we propose TransFuser, a mechanism to integrate image and LiDAR representations using self-attention. Our approach uses transformer modules at multiple resolutions to fuse perspective view and bird's eye view feature maps. We experimentally validate its efficacy on a challenging new benchmark with long routes and dense traffic, as well as the official leaderboard of the CARLA urban driving simulator. At the time of submission, TransFuser outperforms all prior work on the CARLA leaderboard in terms of driving score by a large margin. Compared to geometry-based fusion, TransFuser reduces the average collisions per kilometer by 48%.
    Forward and inverse reinforcement learning sharing network weights and hyperparameters. (arXiv:2008.07284v2 [cs.LG] UPDATED)
    This paper proposes model-free imitation learning named Entropy-Regularized Imitation Learning (ERIL) that minimizes the reverse Kullback-Leibler (KL) divergence. ERIL combines forward and inverse reinforcement learning (RL) under the framework of an entropy-regularized Markov decision process. An inverse RL step computes the log-ratio between two distributions by evaluating two binary discriminators. The first discriminator distinguishes the state generated by the forward RL step from the expert's state. The second discriminator, which is structured by the theory of entropy regularization, distinguishes the state-action-next-state tuples generated by the learner from the expert ones. One notable feature is that the second discriminator shares hyperparameters with the forward RL, which can be used to control the discriminator's ability. A forward RL step minimizes the reverse KL estimated by the inverse RL step. We show that minimizing the reverse KL divergence is equivalent to finding an optimal policy. Our experimental results on MuJoCo-simulated environments and vision-based reaching tasks with a robotic arm show that ERIL is more sample-efficient than the baseline methods. We apply the method to human behaviors that perform a pole-balancing task and describe how the estimated reward functions show how every subject achieves her goal.
    CoRe: Color Regression for Multicolor Fashion Garments. (arXiv:2010.02849v2 [cs.CV] UPDATED)
    Developing deep networks that analyze fashion garments has many real-world applications. Among all fashion attributes, color is one of the most important yet challenging to detect. Existing approaches are classification-based and thus cannot go beyond the list of discrete predefined color names. In this paper, we handle color detection as a regression problem to predict the exact RGB values. That's why in addition to a first color classifier, we include a second regression stage for refinement in our newly proposed architecture. This second step combines two attention models: the first depends on the type of clothing, the second depends on the color previously detected by the classifier. Our final prediction is the weighted spatial pooling over the image pixels RGB values, where the illumination has been corrected. This architecture is modular and easily expanded to detect the RGBs of all colors in a multicolor garment. In our experiments, we show the benefits of each component of our architecture.
    AdaTask: Adaptive Multitask Online Learning. (arXiv:2205.15802v1 [cs.LG])
    We introduce and analyze AdaTask, a multitask online learning algorithm that adapts to the unknown structure of the tasks. When the $N$ tasks are stochastically activated, we show that the regret of AdaTask is better, by a factor that can be as large as $\sqrt{N}$, than the regret achieved by running $N$ independent algorithms, one for each task. AdaTask can be seen as a comparator-adaptive version of Follow-the-Regularized-Leader with a Mahalanobis norm potential. Through a variational formulation of this potential, our analysis reveals how AdaTask jointly learns the tasks and their structure. Experiments supporting our findings are presented.
    Neural Topic Model via Optimal Transport. (arXiv:2008.13537v3 [cs.IR] UPDATED)
    Recently, Neural Topic Models (NTMs) inspired by variational autoencoders have obtained increasingly research interest due to their promising results on text analysis. However, it is usually hard for existing NTMs to achieve good document representation and coherent/diverse topics at the same time. Moreover, they often degrade their performance severely on short documents. The requirement of reparameterisation could also comprise their training quality and model flexibility. To address these shortcomings, we present a new neural topic model via the theory of optimal transport (OT). Specifically, we propose to learn the topic distribution of a document by directly minimising its OT distance to the document's word distributions. Importantly, the cost matrix of the OT distance models the weights between topics and words, which is constructed by the distances between topics and words in an embedding space. Our proposed model can be trained efficiently with a differentiable loss. Extensive experiments show that our framework significantly outperforms the state-of-the-art NTMs on discovering more coherent and diverse topics and deriving better document representations for both regular and short texts.
    Thompson Sampling for Bandits with Clustered Arms. (arXiv:2109.01656v2 [cs.LG] UPDATED)
    We propose algorithms based on a multi-level Thompson sampling scheme, for the stochastic multi-armed bandit and its contextual variant with linear expected rewards, in the setting where arms are clustered. We show, both theoretically and empirically, how exploiting a given cluster structure can significantly improve the regret and computational cost compared to using standard Thompson sampling. In the case of the stochastic multi-armed bandit we give upper bounds on the expected cumulative regret showing how it depends on the quality of the clustering. Finally, we perform an empirical evaluation showing that our algorithms perform well compared to previously proposed algorithms for bandits with clustered arms.
    Non-Iterative Recovery from Nonlinear Observations using Generative Models. (arXiv:2205.15749v1 [cs.LG])
    In this paper, we aim to estimate the direction of an underlying signal from its nonlinear observations following the semi-parametric single index model (SIM). Unlike conventional compressed sensing where the signal is assumed to be sparse, we assume that the signal lies in the range of an $L$-Lipschitz continuous generative model with bounded $k$-dimensional inputs. This is mainly motivated by the tremendous success of deep generative models in various real applications. Our reconstruction method is non-iterative (though approximating the projection step may use an iterative procedure) and highly efficient, and it is shown to attain the near-optimal statistical rate of order $\sqrt{(k \log L)/m}$, where $m$ is the number of measurements. We consider two specific instances of the SIM, namely noisy $1$-bit and cubic measurement models, and perform experiments on image datasets to demonstrate the efficacy of our method. In particular, for the noisy $1$-bit measurement model, we show that our non-iterative method significantly outperforms a state-of-the-art iterative method in terms of both accuracy and efficiency.
    EdgeML: Towards Network-Accelerated Federated Learning over Wireless Edge. (arXiv:2111.09410v4 [cs.NI] UPDATED)
    Federated learning (FL) is a distributed machine learning technology for next-generation AI systems that allows a number of workers, i.e., edge devices, collaboratively learn a shared global model while keeping their data locally to prevent privacy leakage. Enabling FL over wireless multi-hop networks can democratize AI and make it accessible in a cost-effective manner. However, the noisy bandwidth-limited multi-hop wireless connections can lead to delayed and nomadic model updates, which significantly slows down the FL convergence speed. To address such challenges, this paper aims to accelerate FL convergence over wireless edge by optimizing the multi-hop federated networking performance. In particular, the FL convergence optimization problem is formulated as a Markov decision process (MDP). To solve such MDP, multi-agent reinforcement learning (MA-RL) algorithms along with domain-specific action space refining schemes are developed, which online learn the delay-minimum forwarding paths to minimize the model exchange latency between the edge devices (i.e., workers) and the remote server. To validate the proposed solutions, FedEdge is developed and implemented, which is the first experimental framework in the literature for FL over multi-hop wireless edge computing networks. FedEdge allows us to fast prototype, deploy, and evaluate novel FL algorithms along with RL-based system optimization methods in real wireless devices. Moreover, a physical experimental testbed is implemented by customizing the widely adopted Linux wireless routers and ML computing nodes.Finally, our experimentation results on the testbed show that the proposed network-accelerated FL system can practically and significantly improve FL convergence speed, compared to the FL system empowered by the production-grade commercially available wireless networking protocol, BATMAN-Adv.
    Neural Network Guided Evolutionary Fuzzing for Finding Traffic Violations of Autonomous Vehicles. (arXiv:2109.06126v3 [cs.SE] UPDATED)
    Self-driving cars and trucks, autonomous vehicles (AVs), should not be accepted by regulatory bodies and the public until they have much higher confidence in their safety and reliability -- which can most practically and convincingly be achieved by testing. But existing testing methods are inadequate for checking the end-to-end behaviors of AV controllers against complex, real-world corner cases involving interactions with multiple independent agents such as pedestrians and human-driven vehicles. While test-driving AVs on streets and highways fails to capture many rare events, existing simulation-based testing methods mainly focus on simple scenarios and do not scale well for complex driving situations that require sophisticated awareness of the surroundings. To address these limitations, we propose a new fuzz testing technique, called AutoFuzz, which can leverage widely-used AV simulators' API grammars to generate semantically and temporally valid complex driving scenarios (sequences of scenes). To efficiently search for traffic violations-inducing scenarios in a large search space, we propose a constrained neural network (NN) evolutionary search method to optimize AutoFuzz. Evaluation of our prototype on one state-of-the-art learning-based controller, two rule-based controllers, and one industrial-grade controller in five scenarios shows that AutoFuzz efficiently finds hundreds of traffic violations in high-fidelity simulation environments. For each scenario, AutoFuzz can find on average 10-39% more unique traffic violations than the best-performing baseline method. Further, fine-tuning the learning-based controller with the traffic violations found by AutoFuzz successfully reduced the traffic violations found in the new version of the AV controller software.
    Evaluating Robustness to Dataset Shift via Parametric Robustness Sets. (arXiv:2205.15947v1 [cs.LG])
    We give a method for proactively identifying small, plausible shifts in distribution which lead to large differences in model performance. To ensure that these shifts are plausible, we parameterize them in terms of interpretable changes in causal mechanisms of observed variables. This defines a parametric robustness set of plausible distributions and a corresponding worst-case loss. While the loss under an individual parametric shift can be estimated via reweighting techniques such as importance sampling, the resulting worst-case optimization problem is non-convex, and the estimate may suffer from large variance. For small shifts, however, we can construct a local second-order approximation to the loss under shift and cast the problem of finding a worst-case shift as a particular non-convex quadratic optimization problem, for which efficient algorithms are available. We demonstrate that this second-order approximation can be estimated directly for shifts in conditional exponential family models, and we bound the approximation error. We apply our approach to a computer vision task (classifying gender from images), revealing sensitivity to shifts in non-causal attributes.
    Accelerated Quality-Diversity for Robotics through Massive Parallelism. (arXiv:2202.01258v2 [cs.NE] UPDATED)
    Quality-Diversity (QD) optimization algorithms are a well-known approach to generate large collections of diverse and high-quality solutions. However, derived from evolutionary computation, QD algorithms are population-based methods which are known to be data-inefficient and requires large amounts of computational resources. This makes QD algorithms slow when used in applications where solution evaluations are computationally costly. A common approach to speed up QD algorithms is to evaluate solutions in parallel, for instance by using physical simulators in robotics. Yet, this approach is limited to several dozen of parallel evaluations as most physics simulators can only be parallelized more with a greater number of CPUs. With recent advances in simulators that run on accelerators, thousands of evaluations can now be performed in parallel on single GPU/TPU. In this paper, we present QDax, an accelerated implementation of MAP-Elites which leverages massive parallelism on accelerators to make QD algorithms more accessible. We show that QD algorithms are ideal candidates to take advantage of progress in hardware acceleration. We demonstrate that QD algorithms can scale with massive parallelism to be run at interactive timescales without any significant effect on the performance. Results across standard optimization functions and four neuroevolution benchmark environments shows that experiment runtimes are reduced by two factors of magnitudes, turning days of computation into minutes. More surprising, we observe that reducing the number of generations by two orders of magnitude, and thus having significantly shorter lineage does not impact the performance of QD algorithms. These results show that QD can now benefit from hardware acceleration, which contributed significantly to the bloom of deep learning.
    coVariance Neural Networks. (arXiv:2205.15856v1 [cs.LG])
    Graph neural networks (GNN) are an effective framework that exploit inter-relationships within graph-structured data for learning. Principal component analysis (PCA) involves the projection of data on the eigenspace of the covariance matrix and draws similarities with the graph convolutional filters in GNNs. Motivated by this observation, we propose a GNN architecture, called coVariance neural network (VNN), that operates on sample covariance matrices as graphs. We theoretically establish the stability of VNNs to perturbations in the covariance matrix, thus, implying an advantage over standard PCA-based data analysis approaches that are prone to instability due to principal components associated with close eigenvalues. Our experiments on real-world datasets validate our theoretical results and show that VNN performance is indeed more stable than PCA-based statistical approaches. Moreover, our experiments on multi-resolution datasets also demonstrate that VNNs are amenable to transferability of performance over covariance matrices of different dimensions; a feature that is infeasible for PCA-based approaches.
    Deep Visual Navigation under Partial Observability. (arXiv:2109.07752v3 [cs.RO] UPDATED)
    How can a robot navigate successfully in rich and diverse environments, indoors or outdoors, along office corridors or trails on the grassland, on the flat ground or the staircase? To this end, this work aims to address three challenges: (i) complex visual observations, (ii) partial observability of local visual sensing, and (iii) multimodal robot behaviors conditioned on both the local environment and the global navigation objective. We propose to train a neural network (NN) controller for local navigation via imitation learning. To tackle complex visual observations, we extract multi-scale spatial representations through CNNs. To tackle partial observability, we aggregate multi-scale spatial information over time and encode it in LSTMs. To learn multimodal behaviors, we use a separate memory module for each behavior mode. Importantly, we integrate the multiple neural network modules into a unified controller that achieves robust performance for visual navigation in complex, partially observable environments. We implemented the controller on the quadrupedal Spot robot and evaluated it on three challenging tasks: adversarial pedestrian avoidance, blind-spot obstacle avoidance, and elevator riding. The experiments show that the proposed NN architecture significantly improves navigation performance.
    Rethinking Learning Dynamics in RL using Adversarial Networks. (arXiv:2201.11783v2 [cs.LG] UPDATED)
    We present a learning mechanism for reinforcement learning of closely related skills parameterized via a skill embedding space. Our approach is grounded on the intuition that nothing makes you learn better than a coevolving adversary. The main contribution of our work is to formulate an adversarial training regime for reinforcement learning with the help of entropy-regularized policy gradient formulation. We also adapt existing measures of causal attribution to draw insights from the skills learned. This would facilitate easier re-purposing of skills for adaptation to different environments and tasks. Our experiments demonstrate that the adversarial process leads to a better exploration of multiple solutions and understanding the minimum number of different skills necessary to solve a given set of tasks.
    Attribution-based Explanations that Provide Recourse Cannot be Robust. (arXiv:2205.15834v1 [stat.ML])
    Different users of machine learning methods require different explanations, depending on their goals. To make machine learning accountable to society, one important goal is to get actionable options for recourse, which allow an affected user to change the decision $f(x)$ of a machine learning system by making limited changes to its input $x$. We formalize this by providing a general definition of recourse sensitivity, which needs to be instantiated with a utility function that describes which changes to the decisions are relevant to the user. This definition applies to local attribution methods, which attribute an importance weight to each input feature. It is often argued that such local attributions should be robust, in the sense that a small change in the input $x$ that is being explained, should not cause a large change in the feature weights. However, we prove formally that it is in general impossible for any single attribution method to be both recourse sensitive and robust at the same time. It follows that there must always exist counterexamples to at least one of these properties. We provide such counterexamples for several popular attribution methods, including LIME, SHAP, Integrated Gradients and SmoothGrad. Our results also cover counterfactual explanations, which may be viewed as attributions that describe a perturbation of $x$. We further discuss possible ways to work around our impossibility result, for instance by allowing the output to consist of sets with multiple attributions. Finally, we strengthen our impossibility result for the restricted case where users are only able to change a single attribute of x, by providing an exact characterization of the functions $f$ to which impossibility applies.
    Knowledge Graph -- Deep Learning: A Case Study in Question Answering in Aviation Safety Domain. (arXiv:2205.15952v1 [cs.CL])
    In the commercial aviation domain, there are a large number of documents, like, accident reports (NTSB, ASRS) and regulatory directives (ADs). There is a need for a system to access these diverse repositories efficiently in order to service needs in the aviation industry, like maintenance, compliance, and safety. In this paper, we propose a Knowledge Graph (KG) guided Deep Learning (DL) based Question Answering (QA) system for aviation safety. We construct a Knowledge Graph from Aircraft Accident reports and contribute this resource to the community of researchers. The efficacy of this resource is tested and proved by the aforesaid QA system. Natural Language Queries constructed from the documents mentioned above are converted into SPARQL (the interface language of the RDF graph database) queries and answered. On the DL side, we have two different QA models: (i) BERT QA which is a pipeline of Passage Retrieval (Sentence-BERT based) and Question Answering (BERT based), and (ii) the recently released GPT-3. We evaluate our system on a set of queries created from the accident reports. Our combined QA system achieves 9.3% increase in accuracy over GPT-3 and 40.3% increase over BERT QA. Thus, we infer that KG-DL performs better than either singly.
    Consistent Relative Confidence and Label-Free Model Selection for Convolutional Neural Networks. (arXiv:2108.11845v9 [cs.CV] UPDATED)
    In this paper, we are concerned with image classification with deep convolutional neural networks (CNNs). We focus on the following question: given a set of candidate CNN models, how to select the right one with the best generalization property for the current task? Current model selection methods all require access to a batch of labeled data for computing a pre-specified performance metric, such as the cross-entropy loss, the classification error rate and the negative log-likelihood. In many practical cases, labels are not available in time as labeling itself is a time-consuming and expensive task. To this end, we propose an approach to CNN model selection using only unlabeled data. We develop this method based on a principle termed consistent relative confidence. Experimental results on benchmark datasets demonstrate the effectiveness and efficiency of our method.
    Semi-Supervised Cross-Silo Advertising with Partial Knowledge Transfer. (arXiv:2205.15987v1 [cs.LG])
    As an emerging secure learning paradigm in leveraging cross-agency private data, vertical federated learning (VFL) is expected to improve advertising models by enabling the joint learning of complementary user attributes privately owned by the advertiser and the publisher. However, there are two key challenges in applying it to advertising systems: a) the limited scale of labeled overlapping samples, and b) the high cost of real-time cross-agency serving. In this paper, we propose a semi-supervised split distillation framework VFed-SSD to alleviate the two limitations. We identify that: i) there are massive unlabeled overlapped data available in advertising systems, and ii) we can keep a balance between model performance and inference cost by decomposing the federated model. Specifically, we develop a self-supervised task Matched Pair Detection (MPD) to exploit the vertically partitioned unlabeled data and propose the Split Knowledge Distillation (SplitKD) schema to avoid cross-agency serving. Empirical studies on three industrial datasets exhibit the effectiveness of our methods, with the median AUC over all datasets improved by 0.86% and 2.6% in the local deployment mode and the federated deployment mode respectively. Overall, our framework provides an efficient federation-enhanced solution for real-time display advertising with minimal deploying cost and significant performance lift.
    Online Meta-Learning in Adversarial Multi-Armed Bandits. (arXiv:2205.15921v1 [cs.LG])
    We study meta-learning for adversarial multi-armed bandits. We consider the online-within-online setup, in which a player (learner) encounters a sequence of multi-armed bandit episodes. The player's performance is measured as regret against the best arm in each episode, according to the losses generated by an adversary. The difficulty of the problem depends on the empirical distribution of the per-episode best arm chosen by the adversary. We present an algorithm that can leverage the non-uniformity in this empirical distribution, and derive problem-dependent regret bounds. This solution comprises an inner learner that plays each episode separately, and an outer learner that updates the hyper-parameters of the inner algorithm between the episodes. In the case where the best arm distribution is far from uniform, it improves upon the best bound that can be achieved by any online algorithm executed on each episode individually without meta-learning.
    A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification. (arXiv:2107.07511v4 [cs.LG] UPDATED)
    Black-box machine learning learning methods are now routinely used in high-risk settings, like medical diagnostics, which demand uncertainty quantification to avoid consequential model failures. Distribution-free uncertainty quantification (distribution-free UQ) is a user-friendly paradigm for creating statistically rigorous confidence intervals/sets for such predictions. Critically, the intervals/sets are valid without distributional assumptions or model assumptions, possessing explicit guarantees even with finitely many datapoints. Moreover, they adapt to the difficulty of the input; when the input example is difficult, the uncertainty intervals/sets are large, signaling that the model might be wrong. Without much work and without retraining, one can use distribution-free methods on any underlying algorithm, such as a neural network, to produce confidence sets guaranteed to contain the ground truth with a user-specified probability, such as 90%. Indeed, the methods are easy-to-understand and general, applying to many modern prediction problems arising in the fields of computer vision, natural language processing, deep reinforcement learning, and so on. This hands-on introduction is aimed at a reader interested in the practical implementation of distribution-free UQ who is not necessarily a statistician. We lead the reader through the practical theory and applications of distribution-free UQ, beginning with conformal prediction and culminating with distribution-free control of any risk, such as the false-discovery rate, false positive rate of out-of-distribution detection, and so on. We will include many explanatory illustrations, examples, and code samples in Python, with PyTorch syntax. The goal is to provide the reader a working understanding of distribution-free UQ, allowing them to put confidence intervals on their algorithms, with one self-contained document.
    Variable importance without impossible data. (arXiv:2205.15750v1 [cs.LG])
    The most popular methods for measuring importance of the variables in a black box prediction algorithm make use of synthetic inputs that combine predictor variables from multiple subjects. These inputs can be unlikely, physically impossible, or even logically impossible. As a result, the predictions for such cases can be based on data very unlike any the black box was trained on. We think that users cannot trust an explanation of the decision of a prediction algorithm when the explanation uses such values. Instead we advocate a method called Cohort Shapley that is grounded in economic game theory and unlike most other game theoretic methods, it uses only actually observed data to quantify variable importance. Cohort Shapley works by narrowing the cohort of subjects judged to be similar to a target subject on one or more features. A feature is important if using it to narrow the cohort makes a large difference to the cohort mean. We illustrate it on an algorithmic fairness problem where it is essential to attribute importance to protected variables that the model was not trained on. For every subject and every predictor variable, we can compute the importance of that predictor to the subject's predicted response or to their actual response. These values can be aggregated, for example over all Black subjects, and we propose a Bayesian bootstrap to quantify uncertainty in both individual and aggregate Shapley values.
    Kymatio: Scattering Transforms in Python. (arXiv:1812.11214v3 [cs.LG] UPDATED)
    The wavelet scattering transform is an invariant signal representation suitable for many signal processing and machine learning applications. We present the Kymatio software package, an easy-to-use, high-performance Python implementation of the scattering transform in 1D, 2D, and 3D that is compatible with modern deep learning frameworks. All transforms may be executed on a GPU (in addition to CPU), offering a considerable speed up over CPU implementations. The package also has a small memory footprint, resulting inefficient memory usage. The source code, documentation, and examples are available undera BSD license at https://www.kymat.io/
    Hide and Seek: on the Stealthiness of Attacks against Deep Learning Systems. (arXiv:2205.15944v1 [cs.CR])
    With the growing popularity of artificial intelligence and machine learning, a wide spectrum of attacks against deep learning models have been proposed in the literature. Both the evasion attacks and the poisoning attacks attempt to utilize adversarially altered samples to fool the victim model to misclassify the adversarial sample. While such attacks claim to be or are expected to be stealthy, i.e., imperceptible to human eyes, such claims are rarely evaluated. In this paper, we present the first large-scale study on the stealthiness of adversarial samples used in the attacks against deep learning. We have implemented 20 representative adversarial ML attacks on six popular benchmarking datasets. We evaluate the stealthiness of the attack samples using two complementary approaches: (1) a numerical study that adopts 24 metrics for image similarity or quality assessment; and (2) a user study of 3 sets of questionnaires that has collected 20,000+ annotations from 1,000+ responses. Our results show that the majority of the existing attacks introduce nonnegligible perturbations that are not stealthy to human eyes. We further analyze the factors that contribute to attack stealthiness. We further examine the correlation between the numerical analysis and the user studies, and demonstrate that some image quality metrics may provide useful guidance in attack designs, while there is still a significant gap between assessed image quality and visual stealthiness of attacks.
    Optimal Best Arm Identification in Two-Armed Bandits with a Fixed Budget under a Small Gap. (arXiv:2201.04469v6 [stat.ML] UPDATED)
    We consider fixed-budget best arm identification in two-armed bandit problems. One of the longstanding open questions is a tight lower bound on the probability of misidentifying the best arm and a strategy whose upper bound matches the lower bound when the optimal target allocation ratio of arm draws is unknown. We address this problem when the gap between the expected rewards is small. First, we introduce a distribution-dependent lower bound. Then, we propose the ``RS-AIPW'' strategy, which consists of the random sampling (RS) rule using the estimated optimal target allocation ratio and the recommendation rule using the augmented inverse probability weighting (AIPW) estimator. Our proposed strategy is optimal in the sense that the upper bound achieves the lower bound when the budget goes to infinity and the gap goes to zero. In the course of the analysis, we present a novel large deviation bound for martingales.
    Modeling Interactions of Autonomous Vehicles and Pedestrians with Deep Multi-Agent Reinforcement Learning for Collision Avoidance. (arXiv:2109.15266v3 [cs.RO] UPDATED)
    Reliable pedestrian crash avoidance mitigation (PCAM) systems are crucial components of safe autonomous vehicles (AVs). The nature of the vehicle-pedestrian interaction where decisions of one agent directly affect the other agent's optimal behavior, and vice versa, is a challenging yet often neglected aspect of such systems. We address this issue by modeling a Markov decision process (MDP) for a simulated AV-pedestrian interaction at an unmarked crosswalk. The AV's PCAM decision policy is learned through deep reinforcement learning (DRL). Since modeling pedestrians realistically is challenging, we compare two levels of intelligent pedestrian behavior. While the baseline model follows a predefined strategy, our advanced pedestrian model is defined as a second DRL agent. This model captures continuous learning and the uncertainty inherent in human behavior, making the AV-pedestrian interaction a deep multi-agent reinforcement learning (DMARL) problem. We benchmark the developed PCAM systems according to the collision rate and the resulting traffic flow efficiency with a focus on the influence of observation uncertainty on the decision-making of the agents. The results show that the AV is able to completely mitigate collisions under the majority of the investigated conditions and that the DRL pedestrian model learns an intelligent crossing behavior.
    Strategic Classification with Graph Neural Networks. (arXiv:2205.15765v1 [cs.LG])
    Strategic classification studies learning in settings where users can modify their features to obtain favorable predictions. Most current works focus on simple classifiers that trigger independent user responses. Here we examine the implications of learning with more elaborate models that break the independence assumption. Motivated by the idea that applications of strategic classification are often social in nature, we focus on \emph{graph neural networks}, which make use of social relations between users to improve predictions. Using a graph for learning introduces inter-user dependencies in prediction; our key point is that strategic users can exploit these to promote their goals. As we show through analysis and simulation, this can work either against the system -- or for it. Based on this, we propose a differentiable framework for strategically-robust learning of graph-based classifiers. Experiments on several real networked datasets demonstrate the utility of our approach.
    Compressed Hierarchical Representations for Multi-Task Learning and Task Clustering. (arXiv:2205.15882v1 [cs.LG])
    In this paper, we frame homogeneous-feature multi-task learning (MTL) as a hierarchical representation learning problem, with one task-agnostic and multiple task-specific latent representations. Drawing inspiration from the information bottleneck principle and assuming an additive independent noise model between the task-agnostic and task-specific latent representations, we limit the information contained in each task-specific representation. It is shown that our resulting representations yield competitive performance for several MTL benchmarks. Furthermore, for certain setups, we show that the trained parameters of the additive noise model are closely related to the similarity of different tasks. This indicates that our approach yields a task-agnostic representation that is disentangled in the sense that its individual dimensions may be interpretable from a task-specific perspective.
    Smoothed Online Learning is as Easy as Statistical Learning. (arXiv:2202.04690v3 [stat.ML] UPDATED)
    Much of modern learning theory has been split between two regimes: the classical offline setting, where data arrive independently, and the online setting, where data arrive adversarially. While the former model is often both computationally and statistically tractable, the latter requires no distributional assumptions. In an attempt to achieve the best of both worlds, previous work proposed the smooth online setting where each sample is drawn from an adversarially chosen distribution, which is smooth, i.e., it has a bounded density with respect to a fixed dominating measure. We provide tight bounds on the minimax regret of learning a nonparametric function class, with nearly optimal dependence on both the horizon and smoothness parameters. Furthermore, we provide the first oracle-efficient, no-regret algorithms in this setting. In particular, we propose an oracle-efficient improper algorithm whose regret achieves optimal dependence on the horizon and a proper algorithm requiring only a single oracle call per round whose regret has the optimal horizon dependence in the classification setting and is sublinear in general. Both algorithms have exponentially worse dependence on the smoothness parameter of the adversary than the minimax rate. We then prove a lower bound on the oracle complexity of any proper learning algorithm, which matches the oracle-efficient upper bounds up to a polynomial factor, thus demonstrating the existence of a statistical-computational gap in smooth online learning. Finally, we apply our results to the contextual bandit setting to show that if a function class is learnable in the classical setting, then there is an oracle-efficient, no-regret algorithm for contextual bandits in the case that contexts arrive in a smooth manner.
    Sample-Efficient, Exploration-Based Policy Optimisation for Routing Problems. (arXiv:2205.15656v1 [cs.LG])
    Model-free deep-reinforcement-based learning algorithms have been applied to a range of COPs~\cite{bello2016neural}~\cite{kool2018attention}~\cite{nazari2018reinforcement}. However, these approaches suffer from two key challenges when applied to combinatorial problems: insufficient exploration and the requirement of many training examples of the search space to achieve reasonable performance. Combinatorial optimisation can be complex, characterised by search spaces with many optimas and large spaces to search and learn. Therefore, a new method is needed to find good solutions that are more efficient by being more sample efficient. This paper presents a new reinforcement learning approach that is based on entropy. In addition, we design an off-policy-based reinforcement learning technique that maximises the expected return and improves the sample efficiency to achieve faster learning during training time. We systematically evaluate our approach on a range of route optimisation tasks typically used to evaluate learning-based optimisation, such as the such as the Travelling Salesman problems (TSP), Capacitated Vehicle Routing Problem (CVRP). In this paper, we show that our model can generalise to various route problems, such as the split-delivery VRP (SDVRP), and compare the performance of our method with that of current state-of-the-art approaches. The Empirical results show that the proposed method can improve on state-of-the-art methods in terms of solution quality and computation time and generalise to problems of different sizes.
    Unbalanced CO-Optimal Transport. (arXiv:2205.14923v2 [stat.ML] UPDATED)
    Optimal transport (OT) compares probability distributions by computing a meaningful alignment between their samples. CO-optimal transport (COOT) takes this comparison further by inferring an alignment between features as well. While this approach leads to better alignments and generalizes both OT and Gromov-Wasserstein distances, we provide a theoretical result showing that it is sensitive to outliers that are omnipresent in real-world data. This prompts us to propose unbalanced COOT for which we provably show its robustness to noise in the compared datasets. To the best of our knowledge, this is the first such result for OT methods in incomparable spaces. With this result in hand, we provide empirical evidence of this robustness for the challenging tasks of heterogeneous domain adaptation with and without varying proportions of classes and simultaneous alignment of samples and features across single-cell measurements.
    Differentially Private Covariance Revisited. (arXiv:2205.14324v2 [cs.CR] UPDATED)
    In this paper, we present three new error bounds, in terms of the Frobenius norm, for covariance estimation under differential privacy: (1) a worst-case bound of $\tilde{O}(d^{1/4}/\sqrt{n})$, which improves the standard Gaussian mechanism $\tilde{O}(d/n)$ for the regime $d>\widetilde{\Omega}(n^{2/3})$; (2) a trace-sensitive bound that improves the state of the art by a $\sqrt{d}$-factor, and (3) a tail-sensitive bound that gives a more instance-specific result. The corresponding algorithms are also simple and efficient. Experimental results show that they offer significant improvements over prior work.
    Mixture GAN For Modulation Classification Resiliency Against Adversarial Attacks. (arXiv:2205.15743v1 [cs.LG])
    Automatic modulation classification (AMC) using the Deep Neural Network (DNN) approach outperforms the traditional classification techniques, even in the presence of challenging wireless channel environments. However, the adversarial attacks cause the loss of accuracy for the DNN-based AMC by injecting a well-designed perturbation to the wireless channels. In this paper, we propose a novel generative adversarial network (GAN)-based countermeasure approach to safeguard the DNN-based AMC systems against adversarial attack examples. GAN-based aims to eliminate the adversarial attack examples before feeding to the DNN-based classifier. Specifically, we have shown the resiliency of our proposed defense GAN against the Fast-Gradient Sign method (FGSM) algorithm as one of the most potent kinds of attack algorithms to craft the perturbed signals. The existing defense-GAN has been designed for image classification and does not work in our case where the above-mentioned communication system is considered. Thus, our proposed countermeasure approach deploys GANs with a mixture of generators to overcome the mode collapsing problem in a typical GAN facing radio signal classification problem. Simulation results show the effectiveness of our proposed defense GAN so that it could enhance the accuracy of the DNN-based AMC under adversarial attacks to 81%, approximately.
    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers. (arXiv:2205.15868v1 [cs.CV])
    Large-scale pretrained transformers have created milestones in text (GPT-3) and text-to-image (DALL-E and CogView) generation. Its application to video generation is still facing many challenges: The potential huge computation cost makes the training from scratch unaffordable; The scarcity and weak relevance of text-video datasets hinder the model understanding complex movement semantics. In this work, we present 9B-parameter transformer CogVideo, trained by inheriting a pretrained text-to-image model, CogView2. We also propose multi-frame-rate hierarchical training strategy to better align text and video clips. As (probably) the first open-source large-scale pretrained text-to-video model, CogVideo outperforms all publicly available models at a large margin in machine and human evaluations.
    Robust Anytime Learning of Markov Decision Processes. (arXiv:2205.15827v1 [cs.AI])
    Markov decision processes (MDPs) are formal models commonly used in sequential decision-making. MDPs capture the stochasticity that may arise, for instance, from imprecise actuators via probabilities in the transition function. However, in data-driven applications, deriving precise probabilities from (limited) data introduces statistical errors that may lead to unexpected or undesirable outcomes. Uncertain MDPs (uMDPs) do not require precise probabilities but instead use so-called uncertainty sets in the transitions, accounting for such limited data. Tools from the formal verification community efficiently compute robust policies that provably adhere to formal specifications, like safety constraints, under the worst-case instance in the uncertainty set. We continuously learn the transition probabilities of an MDP in a robust anytime-learning approach that combines a dedicated Bayesian inference scheme with the computation of robust policies. In particular, our method (1) approximates probabilities as intervals, (2) adapts to new data that may be inconsistent with an intermediate model, and (3) may be stopped at any time to compute a robust policy on the uMDP that faithfully captures the data so far. We show the effectiveness of our approach and compare it to robust policies computed on uMDPs learned by the UCRL2 reinforcement learning algorithm in an experimental evaluation on several benchmarks.
    Transformers for Multi-Object Tracking on Point Clouds. (arXiv:2205.15730v1 [cs.CV])
    We present TransMOT, a novel transformer-based end-to-end trainable online tracker and detector for point cloud data. The model utilizes a cross- and a self-attention mechanism and is applicable to lidar data in an automotive context, as well as other data types, such as radar. Both track management and the detection of new tracks are performed by the same transformer decoder module and the tracker state is encoded in feature space. With this approach, we make use of the rich latent space of the detector for tracking rather than relying on low-dimensional bounding boxes. Still, we are able to retain some of the desirable properties of traditional Kalman-filter based approaches, such as an ability to handle sensor input at arbitrary timesteps or to compensate frame skips. This is possible due to a novel module that transforms the track information from one frame to the next on feature-level and thereby fulfills a similar task as the prediction step of a Kalman filter. Results are presented on the challenging real-world dataset nuScenes, where the proposed model outperforms its Kalman filter-based tracking baseline.
    Feature Learning in $L_{2}$-regularized DNNs: Attraction/Repulsion and Sparsity. (arXiv:2205.15809v1 [stat.ML])
    We study the loss surface of DNNs with $L_{2}$ regularization. We show that the loss in terms of the parameters can be reformulated into a loss in terms of the layerwise activations $Z_{\ell}$ of the training set. This reformulation reveals the dynamics behind feature learning: each hidden representations $Z_{\ell}$ are optimal w.r.t. to an attraction/repulsion problem and interpolate between the input and output representations, keeping as little information from the input as necessary to construct the activation of the next layer. For positively homogeneous non-linearities, the loss can be further reformulated in terms of the covariances of the hidden representations, which takes the form of a partially convex optimization over a convex cone. This second reformulation allows us to prove a sparsity result for homogeneous DNNs: any local minimum of the $L_{2}$-regularized loss can be achieved with at most $N(N+1)$ neurons in each hidden layer (where $N$ is the size of the training set). We show that this bound is tight by giving an example of a local minimum which requires $N^{2}/4$ hidden neurons. But we also observe numerically that in more traditional settings much less than $N^{2}$ neurons are required to reach the minima.
    Variational inference via Wasserstein gradient flows. (arXiv:2205.15902v1 [stat.ML])
    Along with Markov chain Monte Carlo (MCMC) methods, variational inference (VI) has emerged as a central computational approach to large-scale Bayesian inference. Rather than sampling from the true posterior $\pi$, VI aims at producing a simple but effective approximation $\hat \pi$ to $\pi$ for which summary statistics are easy to compute. However, unlike the well-studied MCMC methodology, VI is still poorly understood and dominated by heuristics. In this work, we propose principled methods for VI, in which $\hat \pi$ is taken to be a Gaussian or a mixture of Gaussians, which rest upon the theory of gradient flows on the Bures-Wasserstein space of Gaussian measures. Akin to MCMC, it comes with strong theoretical guarantees when $\pi$ is log-concave.
    Omni-Granular Ego-Semantic Propagation for Self-Supervised Graph Representation Learning. (arXiv:2205.15746v1 [cs.LG])
    Unsupervised/self-supervised graph representation learning is critical for downstream node- and graph-level classification tasks. Global structure of graphs helps discriminating representations and existing methods mainly utilize the global structure by imposing additional supervisions. However, their global semantics are usually invariant for all nodes/graphs and they fail to explicitly embed the global semantics to enrich the representations. In this paper, we propose Omni-Granular Ego-Semantic Propagation for Self-Supervised Graph Representation Learning (OEPG). Specifically, we introduce instance-adaptive global-aware ego-semantic descriptors, leveraging the first- and second-order feature differences between each node/graph and hierarchical global clusters of the entire graph dataset. The descriptors can be explicitly integrated into local graph convolution as new neighbor nodes. Besides, we design an omni-granular normalization on the whole scales and hierarchies of the ego-semantic to assign attentional weight to each descriptor from an omni-granular perspective. Specialized pretext tasks and cross-iteration momentum update are further developed for local-global mutual adaptation. In downstream tasks, OEPG consistently achieves the best performance with a 2%~6% accuracy gain on multiple datasets cross scales and domains. Notably, OEPG also generalizes to quantity- and topology-imbalance scenarios.
    Learning to branch with Tree MDPs. (arXiv:2205.11107v2 [cs.LG] UPDATED)
    State-of-the-art Mixed Integer Linear Program (MILP) solvers combine systematic tree search with a plethora of hard-coded heuristics, such as the branching rule. The idea of learning branching rules from data has received increasing attention recently, and promising results have been obtained by learning fast approximations of the strong branching expert. In this work, we instead propose to learn branching rules from scratch via Reinforcement Learning (RL). We revisit the work of Etheve et al. (2020) and propose tree Markov Decision Processes, or tree MDPs, a generalization of temporal MDPs that provides a more suitable framework for learning to branch. We derive a tree policy gradient theorem, which exhibits a better credit assignment compared to its temporal counterpart. We demonstrate through computational experiments that tree MDPs improve the learning convergence, and offer a promising framework for tackling the learning-to-branch problem in MILPs.
    Augmentations: An Insight into their Effectiveness on Convolution Neural Networks. (arXiv:2205.04064v2 [cs.LG] UPDATED)
    Augmentations are the key factor in determining the performance of any neural network as they provide a model with a critical edge in boosting its performance. Their ability to boost a model's robustness depends on two factors, viz-a-viz, the model architecture, and the type of augmentations. Augmentations are very specific to a dataset, and it is not imperative that all kinds of augmentation would necessarily produce a positive effect on a model's performance. Hence there is a need to identify augmentations that perform consistently well across a variety of datasets and also remain invariant to the type of architecture, convolutions, and the number of parameters used. Hence there is a need to identify augmentations that perform consistently well across a variety of datasets and also remain invariant to the type of architecture, convolutions, and the number of parameters used. This paper evaluates the effect of parameters using 3x3 and depth-wise separable convolutions on different augmentation techniques on MNIST, FMNIST, and CIFAR10 datasets. Statistical Evidence shows that techniques such as Cutouts and Random horizontal flip were consistent on both parametrically low and high architectures. Depth-wise separable convolutions outperformed 3x3 convolutions at higher parameters due to their ability to create deeper networks. Augmentations resulted in bridging the accuracy gap between the 3x3 and depth-wise separable convolutions, thus establishing their role in model generalization. At higher number augmentations did not produce a significant change in performance. The synergistic effect of multiple augmentations at higher parameters, with antagonistic effect at lower parameters, was also evaluated. The work proves that a delicate balance between architectural supremacy and augmentations needs to be achieved to enhance a model's performance in any given deep learning task.
    One Loss for Quantization: Deep Hashing with Discrete Wasserstein Distributional Matching. (arXiv:2205.15721v1 [cs.CV])
    Image hashing is a principled approximate nearest neighbor approach to find similar items to a query in a large collection of images. Hashing aims to learn a binary-output function that maps an image to a binary vector. For optimal retrieval performance, producing balanced hash codes with low-quantization error to bridge the gap between the learning stage's continuous relaxation and the inference stage's discrete quantization is important. However, in the existing deep supervised hashing methods, coding balance and low-quantization error are difficult to achieve and involve several losses. We argue that this is because the existing quantization approaches in these methods are heuristically constructed and not effective to achieve these objectives. This paper considers an alternative approach to learning the quantization constraints. The task of learning balanced codes with low quantization error is re-formulated as matching the learned distribution of the continuous codes to a pre-defined discrete, uniform distribution. This is equivalent to minimizing the distance between two distributions. We then propose a computationally efficient distributional distance by leveraging the discrete property of the hash functions. This distributional distance is a valid distance and enjoys lower time and sample complexities. The proposed single-loss quantization objective can be integrated into any existing supervised hashing method to improve code balance and quantization error. Experiments confirm that the proposed approach substantially improves the performance of several representative hashing~methods.
    Lessons Learned from Data-Driven Building Control Experiments: Contrasting Gaussian Process-based MPC, Bilevel DeePC, and Deep Reinforcement Learning. (arXiv:2205.15703v1 [eess.SY])
    This manuscript offers the perspective of experimentalists on a number of modern data-driven techniques: model predictive control relying on Gaussian processes, adaptive data-driven control based on behavioral theory, and deep reinforcement learning. These techniques are compared in terms of data requirements, ease of use, computational burden, and robustness in the context of real-world applications. Our remarks and observations stem from a number of experimental investigations carried out in the field of building control in diverse environments, from lecture halls and apartment spaces to a hospital surgery center. The final goal is to support others in identifying what technique is best suited to tackle their own problems.
    Template based Graph Neural Network with Optimal Transport Distances. (arXiv:2205.15733v1 [cs.LG])
    Current Graph Neural Networks (GNN) architectures generally rely on two important components: node features embedding through message passing, and aggregation with a specialized form of pooling. The structural (or topological) information is implicitly taken into account in these two steps. We propose in this work a novel point of view, which places distances to some learnable graph templates at the core of the graph representation. This distance embedding is constructed thanks to an optimal transport distance: the Fused Gromov-Wasserstein (FGW) distance, which encodes simultaneously feature and structure dissimilarities by solving a soft graph-matching problem. We postulate that the vector of FGW distances to a set of template graphs has a strong discriminative power, which is then fed to a non-linear classifier for final predictions. Distance embedding can be seen as a new layer, and can leverage on existing message passing techniques to promote sensible feature representations. Interestingly enough, in our work the optimal set of template graphs is also learnt in an end-to-end fashion by differentiating through this layer. After describing the corresponding learning procedure, we empirically validate our claim on several synthetic and real life graph classification datasets, where our method is competitive or surpasses kernel and GNN state-of-the-art approaches. We complete our experiments by an ablation study and a sensitivity analysis to parameters.
    Adversarial synthesis based data-augmentation for code-switched spoken language identification. (arXiv:2205.15747v1 [eess.AS])
    Spoken Language Identification (LID) is an important sub-task of Automatic Speech Recognition(ASR) that is used to classify the language(s) in an audio segment. Automatic LID plays an useful role in multilingual countries. In various countries, identifying a language becomes hard, due to the multilingual scenario where two or more than two languages are mixed together during conversation. Such phenomenon of speech is called as code-mixing or code-switching. This nature is followed not only in India but also in many Asian countries. Such code-mixed data is hard to find, which further reduces the capabilities of the spoken LID. Due to the lack of avalibility of this code-mixed data, it becomes a minority class in LID task. Hence, this work primarily addresses this problem using data augmentation as a solution on the minority code-switched class. This study focuses on Indic language code-mixed with English. Spoken LID is performed on Hindi, code-mixed with English. This research proposes Generative Adversarial Network (GAN) based data augmentation technique performed using Mel spectrograms for audio data. GANs have already been proven to be accurate in representing the real data distribution in the image domain. Proposed research exploits these capabilities of GANs in speech domains such as speech classification, automatic speech recognition,etc. GANs are trained to generate Mel spectrograms of the minority code-mixed class which are then used to augment data for the classifier. Utilizing GANs give an overall improvement on Unweighted Average Recall by an amount of 3.5\% as compared to a Convolutional Recurrent Neural Network (CRNN) classifier used as the baseline reference.
    Implicitly Regularized RL with Implicit Q-Values. (arXiv:2108.07041v2 [cs.LG] UPDATED)
    The $Q$-function is a central quantity in many Reinforcement Learning (RL) algorithms for which RL agents behave following a (soft)-greedy policy w.r.t. to $Q$. It is a powerful tool that allows action selection without a model of the environment and even without explicitly modeling the policy. Yet, this scheme can only be used in discrete action tasks, with small numbers of actions, as the softmax cannot be computed exactly otherwise. Especially the usage of function approximation, to deal with continuous action spaces in modern actor-critic architectures, intrinsically prevents the exact computation of a softmax. We propose to alleviate this issue by parametrizing the $Q$-function implicitly, as the sum of a log-policy and of a value function. We use the resulting parametrization to derive a practical off-policy deep RL algorithm, suitable for large action spaces, and that enforces the softmax relation between the policy and the $Q$-value. We provide a theoretical analysis of our algorithm: from an Approximate Dynamic Programming perspective, we show its equivalence to a regularized version of value iteration, accounting for both entropy and Kullback-Leibler regularization, and that enjoys beneficial error propagation results. We then evaluate our algorithm on classic control tasks, where its results compete with state-of-the-art methods.
    Communication-Efficient Distributionally Robust Decentralized Learning. (arXiv:2205.15614v1 [cs.LG])
    Decentralized learning algorithms empower interconnected edge devices to share data and computational resources to collaboratively train a machine learning model without the aid of a central coordinator (e.g. an orchestrating basestation). In the case of heterogeneous data distributions at the network devices, collaboration can yield predictors with unsatisfactory performance for a subset of the devices. For this reason, in this work we consider the formulation of a distributionally robust decentralized learning task and we propose a decentralized single loop gradient descent/ascent algorithm (AD-GDA) to solve the underlying minimax optimization problem. We render our algorithm communication efficient by employing a compressed consensus scheme and we provide convergence guarantees for smooth convex and non-convex loss functions. Finally, we corroborate the theoretical findings with empirical evidence of the ability of the proposed algorithm in providing unbiased predictors over a network of collaborating devices with highly heterogeneous data distributions.
    Snapture -- A Novel Neural Architecture for Combined Static and Dynamic Hand Gesture Recognition. (arXiv:2205.15862v1 [cs.CV])
    As robots are expected to get more involved in people's everyday lives, frameworks that enable intuitive user interfaces are in demand. Hand gesture recognition systems provide a natural way of communication and, thus, are an integral part of seamless Human-Robot Interaction (HRI). Recent years have witnessed an immense evolution of computational models powered by deep learning. However, state-of-the-art models fall short in expanding across different gesture domains, such as emblems and co-speech. In this paper, we propose a novel hybrid hand gesture recognition system. Our architecture enables learning both static and dynamic gestures: by capturing a so-called "snapshot" of the gesture performance at its peak, we integrate the hand pose along with the dynamic movement. Moreover, we present a method for analyzing the motion profile of a gesture to uncover its dynamic characteristics and which allows regulating a static channel based on the amount of motion. Our evaluation demonstrates the superiority of our approach on two gesture benchmarks compared to a CNNLSTM baseline. We also provide an analysis on a gesture class basis that unveils the potential of our Snapture architecture for performance improvements. Thanks to its modular implementation, our framework allows the integration of other multimodal data like facial expressions and head tracking, which are important cues in HRI scenarios, into one architecture. Thus, our work contributes both to gesture recognition research and machine learning applications for non-verbal communication with robots.
    A Meta Reinforcement Learning Approach for Predictive Autoscaling in the Cloud. (arXiv:2205.15795v1 [cs.LG])
    Predictive autoscaling (autoscaling with workload forecasting) is an important mechanism that supports autonomous adjustment of computing resources in accordance with fluctuating workload demands in the Cloud. In recent works, Reinforcement Learning (RL) has been introduced as a promising approach to learn the resource management policies to guide the scaling actions under the dynamic and uncertain cloud environment. However, RL methods face the following challenges in steering predictive autoscaling, such as lack of accuracy in decision-making, inefficient sampling and significant variability in workload patterns that may cause policies to fail at test time. To this end, we propose an end-to-end predictive meta model-based RL algorithm, aiming to optimally allocate resource to maintain a stable CPU utilization level, which incorporates a specially-designed deep periodic workload prediction model as the input and embeds the Neural Process to guide the learning of the optimal scaling actions over numerous application services in the Cloud. Our algorithm not only ensures the predictability and accuracy of the scaling strategy, but also enables the scaling decisions to adapt to the changing workloads with high sample efficiency. Our method has achieved significant performance improvement compared to the existing algorithms and has been deployed online at Alipay, supporting the autoscaling of applications for the world-leading payment platform.
    Continuous Temporal Graph Networks for Event-Based Graph Data. (arXiv:2205.15924v1 [cs.LG])
    There has been an increasing interest in modeling continuous-time dynamics of temporal graph data. Previous methods encode time-evolving relational information into a low-dimensional representation by specifying discrete layers of neural networks, while real-world dynamic graphs often vary continuously over time. Hence, we propose Continuous Temporal Graph Networks (CTGNs) to capture the continuous dynamics of temporal graph data. We use both the link starting timestamps and link duration as evolving information to model the continuous dynamics of nodes. The key idea is to use neural ordinary differential equations (ODE) to characterize the continuous dynamics of node representations over dynamic graphs. We parameterize ordinary differential equations using a novel graph neural network. The existing dynamic graph networks can be considered as a specific discretization of CTGNs. Experiment results on both transductive and inductive tasks demonstrate the effectiveness of our proposed approach over competitive baselines.
    FedHarmony: Unlearning Scanner Bias with Distributed Data. (arXiv:2205.15970v1 [cs.LG])
    The ability to combine data across scanners and studies is vital for neuroimaging, to increase both statistical power and the representation of biological variability. However, combining datasets across sites leads to two challenges: first, an increase in undesirable non-biological variance due to scanner and acquisition differences - the harmonisation problem - and second, data privacy concerns due to the inherently personal nature of medical imaging data, meaning that sharing them across sites may risk violation of privacy laws. To overcome these restrictions, we propose FedHarmony: a harmonisation framework operating in the federated learning paradigm. We show that to remove the scanner-specific effects, we only need to share the mean and standard deviation of the learned features, helping to protect individual subjects' privacy. We demonstrate our approach across a range of realistic data scenarios, using real multi-site data from the ABIDE dataset, thus showing the potential utility of our method for MRI harmonisation across studies. Our code is available at https://github.com/nkdinsdale/FedHarmony.
    Intrinsic Dimension Estimation Using Wasserstein Distances. (arXiv:2106.04018v2 [stat.ML] UPDATED)
    It has long been thought that high-dimensional data encountered in many practical machine learning tasks have low-dimensional structure, i.e., the manifold hypothesis holds. A natural question, thus, is to estimate the intrinsic dimension of a given population distribution from a finite sample. We introduce a new estimator of the intrinsic dimension and provide finite sample, non-asymptotic guarantees. We then apply our techniques to get new sample complexity bounds for Generative Adversarial Networks (GANs) depending only on the intrinsic dimension of the data.
    SAMURAI: Shape And Material from Unconstrained Real-world Arbitrary Image collections. (arXiv:2205.15768v1 [cs.CV])
    Inverse rendering of an object under entirely unknown capture conditions is a fundamental challenge in computer vision and graphics. Neural approaches such as NeRF have achieved photorealistic results on novel view synthesis, but they require known camera poses. Solving this problem with unknown camera poses is highly challenging as it requires joint optimization over shape, radiance, and pose. This problem is exacerbated when the input images are captured in the wild with varying backgrounds and illuminations. Standard pose estimation techniques fail in such image collections in the wild due to very few estimated correspondences across images. Furthermore, NeRF cannot relight a scene under any illumination, as it operates on radiance (the product of reflectance and illumination). We propose a joint optimization framework to estimate the shape, BRDF, and per-image camera pose and illumination. Our method works on in-the-wild online image collections of an object and produces relightable 3D assets for several use-cases such as AR/VR. To our knowledge, our method is the first to tackle this severely unconstrained task with minimal user interaction. Project page: https://markboss.me/publication/2022-samurai/ Video: https://youtu.be/LlYuGDjXp-8
    Few-Shot Unlearning by Model Inversion. (arXiv:2205.15567v1 [cs.LG])
    We consider the problem of machine unlearning to erase a target dataset, which causes an unwanted behavior, from the trained model when the training dataset is not given. Previous works have assumed that the target dataset indicates all the training data imposing the unwanted behavior. However, it is often infeasible to obtain such a complete indication. We hence address a practical scenario of unlearning provided a few samples of target data, so-called few-shot unlearning. To this end, we devise a straightforward framework, including a new model inversion technique to retrieve the training data from the model, followed by filtering out samples similar to the target samples and then relearning. We demonstrate that our method using only a subset of target data can outperform the state-of-the-art methods with a full indication of target data.
    Multi-Agent Learning of Numerical Methods for Hyperbolic PDEs with Factored Dec-MDP. (arXiv:2205.15716v1 [cs.LG])
    Factored decentralized Markov decision process (Dec-MDP) is a framework for modeling sequential decision making problems in multi-agent systems. In this paper, we formalize the learning of numerical methods for hyperbolic partial differential equations (PDEs), specifically the Weighted Essentially Non-Oscillatory (WENO) scheme, as a factored Dec-MDP problem. We show that different reward formulations lead to either reinforcement learning (RL) or behavior cloning, and a homogeneous policy could be learned for all agents under the RL formulation with a policy gradient algorithm. Because the trained agents only act on their local observations, the multi-agent system can be used as a general numerical method for hyperbolic PDEs and generalize to different spatial discretizations, episode lengths, dimensions, and even equation types.
    What Knowledge Gets Distilled in Knowledge Distillation?. (arXiv:2205.16004v1 [cs.CV])
    Knowledge distillation aims to transfer useful information from a teacher network to a student network, with the primary goal of improving the student's performance for the task at hand. Over the years, there has a been a deluge of novel techniques and use cases of knowledge distillation. Yet, despite the various improvements, there seems to be a glaring gap in the community's fundamental understanding of the process. Specifically, what is the knowledge that gets distilled in knowledge distillation? In other words, in what ways does the student become similar to the teacher? Does it start to localize objects in the same way? Does it get fooled by the same adversarial samples? Does its data invariance properties become similar? Our work presents a comprehensive study to try to answer these questions and more. Our results, using image classification as a case study and three state-of-the-art knowledge distillation techniques, show that knowledge distillation methods can indeed indirectly distill other kinds of properties beyond improving task performance. By exploring these questions, we hope for our work to provide a clearer picture of what happens during knowledge distillation.
    Hollywood Identity Bias Dataset: A Context Oriented Bias Analysis of Movie Dialogues. (arXiv:2205.15951v1 [cs.CL])
    Movies reflect society and also hold power to transform opinions. Social biases and stereotypes present in movies can cause extensive damage due to their reach. These biases are not always found to be the need of storyline but can creep in as the author's bias. Movie production houses would prefer to ascertain that the bias present in a script is the story's demand. Today, when deep learning models can give human-level accuracy in multiple tasks, having an AI solution to identify the biases present in the script at the writing stage can help them avoid the inconvenience of stalled release, lawsuits, etc. Since AI solutions are data intensive and there exists no domain specific data to address the problem of biases in scripts, we introduce a new dataset of movie scripts that are annotated for identity bias. The dataset contains dialogue turns annotated for (i) bias labels for seven categories, viz., gender, race/ethnicity, religion, age, occupation, LGBTQ, and other, which contains biases like body shaming, personality bias, etc. (ii) labels for sensitivity, stereotype, sentiment, emotion, emotion intensity, (iii) all labels annotated with context awareness, (iv) target groups and reason for bias labels and (v) expert-driven group-validation process for high quality annotations. We also report various baseline performances for bias identification and category detection on our dataset.
    On the potential of sequential and non-sequential regression models for Sentinel-1-based biomass prediction in Tanzanian miombo forests. (arXiv:2106.15020v2 [cs.LG] UPDATED)
    This study derives regression models for above-ground biomass (AGB) estimation in miombo woodlands of Tanzania that utilise the high availability and low cost of Sentinel-1 data. The limited forest canopy penetration of C-band SAR sensors along with the sparseness of available ground truth restrict their usefulness in traditional AGB regression models. Therefore, we propose to use AGB predictions based on airborne laser scanning (ALS) data as a surrogate response variable for SAR data. This dramatically increases the available training data and opens for flexible regression models that capture fine-scale AGB dynamics. This becomes a sequential modelling approach, where the first regression stage has linked in situ data to ALS data and produced the AGB prediction map; We perform the subsequent stage, where this map is related to Sentinel-1 data. We develop a traditional, parametric regression model and alternative non-parametric models for this stage. The latter uses a conditional generative adversarial network (cGAN) to translate Sentinel-1 images into ALS-based AGB prediction maps. The convolution filters in the neural networks make them contextual. We compare the sequential models to traditional, non-sequential regression models, all trained on limited AGB ground reference data. Results show that our newly proposed non-sequential Sentinel-1-based regression model performs better quantitatively than the sequential models, but achieves less sensitivity to fine-scale AGB dynamics. The contextual cGAN-based sequential models best reproduce the distribution of ALS-based AGB predictions. They also reach a lower RMSE against in situ AGB data than the parametric sequential model, indicating a potential for further development.
    Simulated Adversarial Testing of Face Recognition Models. (arXiv:2106.04569v3 [cs.CV] UPDATED)
    Most machine learning models are validated and tested on fixed datasets. This can give an incomplete picture of the capabilities and weaknesses of the model. Such weaknesses can be revealed at test time in the real world. The risks involved in such failures can be loss of profits, loss of time or even loss of life in certain critical applications. In order to alleviate this issue, simulators can be controlled in a fine-grained manner using interpretable parameters to explore the semantic image manifold. In this work, we propose a framework for learning how to test machine learning algorithms using simulators in an adversarial manner in order to find weaknesses in the model before deploying it in critical scenarios. We apply this method in a face recognition setup. We show that certain weaknesses of models trained on real data can be discovered using simulated samples. Using our proposed method, we can find adversarial synthetic faces that fool contemporary face recognition models. This demonstrates the fact that these models have weaknesses that are not measured by commonly used validation datasets. We hypothesize that this type of adversarial examples are not isolated, but usually lie in connected spaces in the latent space of the simulator. We present a method to find these adversarial regions as opposed to the typical adversarial points found in the adversarial example literature.
    DeepDefacer: Automatic Removal of Facial Features via U-Net Image Segmentation. (arXiv:2205.15536v1 [cs.CV])
    Recent advancements in the field of magnetic resonance imaging (MRI) have enabled large-scale collaboration among clinicians and researchers for neuroimaging tasks. However, researchers are often forced to use outdated and slow software to anonymize MRI images for publication. These programs specifically perform expensive mathematical operations over 3D images that rapidly slow down anonymization speed as an image's volume increases in size. In this paper, we introduce DeepDefacer, an application of deep learning to MRI anonymization that uses a streamlined 3D U-Net network to mask facial regions in MRI images with a significant increase in speed over traditional de-identification software. We train DeepDefacer on MRI images from the Brain Development Organization (IXI) and International Consortium for Brain Mapping (ICBM) and quantitatively evaluate our model against a baseline 3D U-Net model with regards to Dice, recall, and precision scores. We also evaluate DeepDefacer against Pydeface, a traditional defacing application, with regards to speed on a range of CPU and GPU devices and qualitatively evaluate our model's defaced output versus the ground truth images produced by Pydeface. We provide a link to a PyPi program at the end of this manuscript to encourage further research into the application of deep learning to MRI anonymization.
    Granular Generalized Variable Precision Rough Sets and Rational Approximations. (arXiv:2205.14365v2 [cs.AI] UPDATED)
    Rational approximations are introduced and studied in granular graded sets and generalizations thereof by the first author in recent research papers. The concept of rationality is determined by related ontologies and coherence between granularity, parthood perspective and approximations used in the context. In addition, a framework is introduced by her in the mentioned paper(s). Granular approximations constructed as per the procedures of VPRS are likely to be more rational than those constructed from a classical perspective under certain conditions. This may continue to hold for some generalizations of the former; however, a formal characterization of such conditions is not available in the previously published literature. In this research, theoretical aspects of the problem are critically examined, uniform generalizations of granular VPRS are introduced, new connections with granular graded rough sets are proved, appropriate concepts of substantial parthood are introduced, and their extent of compatibility with the framework is accessed. Furthermore, meta applications to cluster validation, image segmentation and dynamic sorting are invented. Basic assumptions made are explained, and additional examples are constructed for readability.
    k-Means Maximum Entropy Exploration. (arXiv:2205.15623v1 [cs.LG])
    Exploration in high-dimensional, continuous spaces with sparse rewards is an open problem in reinforcement learning. Artificial curiosity algorithms address this by creating rewards that lead to exploration. Given a reinforcement learning algorithm capable of maximizing rewards, the problem reduces to finding an optimization objective consistent with exploration. Maximum entropy exploration uses the entropy of the state visitation distribution as such an objective. However, efficiently estimating the entropy of the state visitation distribution is challenging in high-dimensional, continuous spaces. We introduce an artificial curiosity algorithm based on lower bounding an approximation to the entropy of the state visitation distribution. The bound relies on a result for non-parametric density estimation in arbitrary dimensions using k-means. We show that our approach is both computationally efficient and competitive on benchmarks for exploration in high-dimensional, continuous spaces, especially on tasks where reinforcement learning algorithms are unable to find rewards.
    Likelihood-Free Inference with Generative Neural Networks via Scoring Rule Minimization. (arXiv:2205.15784v1 [stat.CO])
    Bayesian Likelihood-Free Inference methods yield posterior approximations for simulator models with intractable likelihood. Recently, many works trained neural networks to approximate either the intractable likelihood or the posterior directly. Most proposals use normalizing flows, namely neural networks parametrizing invertible maps used to transform samples from an underlying base measure; the probability density of the transformed samples is then accessible and the normalizing flow can be trained via maximum likelihood on simulated parameter-observation pairs. A recent work [Ramesh et al., 2022] approximated instead the posterior with generative networks, which drop the invertibility requirement and are thus a more flexible class of distributions scaling to high-dimensional and structured data. However, generative networks only allow sampling from the parametrized distribution; for this reason, Ramesh et al. [2022] follows the common solution of adversarial training, where the generative network plays a min-max game against a "critic" network. This procedure is unstable and can lead to a learned distribution underestimating the uncertainty - in extreme cases collapsing to a single point. Here, we propose to approximate the posterior with generative networks trained by Scoring Rule minimization, an overlooked adversarial-free method enabling smooth training and better uncertainty quantification. In simulation studies, the Scoring Rule approach yields better performances with shorter training time with respect to the adversarial framework.
    Semantic Autoencoder and Its Potential Usage for Adversarial Attack. (arXiv:2205.15592v1 [cs.LG])
    Autoencoder can give rise to an appropriate latent representation of the input data, however, the representation which is solely based on the intrinsic property of the input data, is usually inferior to express some semantic information. A typical case is the potential incapability of forming a clear boundary upon clustering of these representations. By encoding the latent representation that not only depends on the content of the input data, but also the semantic of the input data, such as label information, we propose an enhanced autoencoder architecture named semantic autoencoder. Experiments of representation distribution via t-SNE shows a clear distinction between these two types of encoders and confirm the supremacy of the semantic one, whilst the decoded samples of these two types of autoencoders exhibit faint dissimilarity either objectively or subjectively. Based on this observation, we consider adversarial attacks to learning algorithms that rely on the latent representation obtained via autoencoders. It turns out that latent contents of adversarial samples constructed from semantic encoder with deliberate wrong label information exhibit different distribution compared with that of the original input data, while both of these samples manifest very marginal difference. This new way of attack set up by our work is worthy of attention due to the necessity to secure the widespread deep learning applications.
    Label-Enhanced Graph Neural Network for Semi-supervised Node Classification. (arXiv:2205.15653v1 [cs.LG])
    Graph Neural Networks (GNNs) have been widely applied in the semi-supervised node classification task, where a key point lies in how to sufficiently leverage the limited but valuable label information. Most of the classical GNNs solely use the known labels for computing the classification loss at the output. In recent years, several methods have been designed to additionally utilize the labels at the input. One part of the methods augment the node features via concatenating or adding them with the one-hot encodings of labels, while other methods optimize the graph structure by assuming neighboring nodes tend to have the same label. To bring into full play the rich information of labels, in this paper, we present a label-enhanced learning framework for GNNs, which first models each label as a virtual center for intra-class nodes and then jointly learns the representations of both nodes and labels. Our approach could not only smooth the representations of nodes belonging to the same class, but also explicitly encode the label semantics into the learning process of GNNs. Moreover, a training node selection technique is provided to eliminate the potential label leakage issue and guarantee the model generalization ability. Finally, an adaptive self-training strategy is proposed to iteratively enlarge the training set with more reliable pseudo labels and distinguish the importance of each pseudo-labeled node during the model training process. Experimental results on both real-world and synthetic datasets demonstrate our approach can not only consistently outperform the state-of-the-arts, but also effectively smooth the representations of intra-class nodes.
    A Computation and Communication Efficient Method for Distributed Nonconvex Problems in the Partial Participation Setting. (arXiv:2205.15580v1 [cs.LG])
    We present a new method that includes three key components of distributed optimization and federated learning: variance reduction of stochastic gradients, compressed communication, and partial participation. We prove that the new method has optimal oracle complexity and state-of-the-art communication complexity in the partial participation setting. Moreover, we observe that "1 + 1 + 1 is not 3": by mixing variance reduction of stochastic gradients with compressed communication and partial participation, we do not obtain a fully synergetic effect. We explain the nature of this phenomenon, argue that this is to be expected, and propose possible workarounds.
    GlanceNets: Interpretabile, Leak-proof Concept-based Models. (arXiv:2205.15612v1 [cs.LG])
    There is growing interest in concept-based models (CBMs) that combine high-performance and interpretability by acquiring and reasoning with a vocabulary of high-level concepts. A key requirement is that the concepts be interpretable. Existing CBMs tackle this desideratum using a variety of heuristics based on unclear notions of interpretability, and fail to acquire concepts with the intended semantics. We address this by providing a clear definition of interpretability in terms of alignment between the model's representation and an underlying data generation process, and introduce GlanceNets, a new CBM that exploits techniques from disentangled representation learning and open-set recognition to achieve alignment, thus improving the interpretability of the learned concepts. We show that GlanceNets, paired with concept-level supervision, achieve better alignment than state-of-the-art approaches while preventing spurious information from unintendedly leaking into the learned concepts.
    HyperMAML: Few-Shot Adaptation of Deep Models with Hypernetworks. (arXiv:2205.15745v1 [cs.LG])
    The aim of Few-Shot learning methods is to train models which can easily adapt to previously unseen tasks, based on small amounts of data. One of the most popular and elegant Few-Shot learning approaches is Model-Agnostic Meta-Learning (MAML). The main idea behind this method is to learn the general weights of the meta-model, which are further adapted to specific problems in a small number of gradient steps. However, the model's main limitation lies in the fact that the update procedure is realized by gradient-based optimisation. In consequence, MAML cannot always modify weights to the essential level in one or even a few gradient iterations. On the other hand, using many gradient steps results in a complex and time-consuming optimization procedure, which is hard to train in practice, and may lead to overfitting. In this paper, we propose HyperMAML, a novel generalization of MAML, where the training of the update procedure is also part of the model. Namely, in HyperMAML, instead of updating the weights with gradient descent, we use for this purpose a trainable Hypernetwork. Consequently, in this framework, the model can generate significant updates whose range is not limited to a fixed number of gradient steps. Experiments show that HyperMAML consistently outperforms MAML and performs comparably to other state-of-the-art techniques in a number of standard Few-Shot learning benchmarks.
    Concept-level Debugging of Part-Prototype Networks. (arXiv:2205.15769v1 [cs.LG])
    Part-prototype Networks (ProtoPNets) are concept-based classifiers designed to achieve the same performance as black-box models without compromising transparency. ProtoPNets compute predictions based on similarity to class-specific part-prototypes learned to recognize parts of training examples, making it easy to faithfully determine what examples are responsible for any target prediction and why. However, like other models, they are prone to picking up confounds and shortcuts from the data, thus suffering from compromised prediction accuracy and limited generalization. We propose ProtoPDebug, an effective concept-level debugger for ProtoPNets in which a human supervisor, guided by the model's explanations, supplies feedback in the form of what part-prototypes must be forgotten or kept, and the model is fine-tuned to align with this supervision. An extensive empirical evaluation on synthetic and real-world data shows that ProtoPDebug outperforms state-of-the-art debuggers for a fraction of the annotation cost.
    Investigating the Role of Image Retrieval for Visual Localization -- An exhaustive benchmark. (arXiv:2205.15761v1 [cs.CV])
    Visual localization, i.e., camera pose estimation in a known scene, is a core component of technologies such as autonomous driving and augmented reality. State-of-the-art localization approaches often rely on image retrieval techniques for one of two purposes: (1) provide an approximate pose estimate or (2) determine which parts of the scene are potentially visible in a given query image. It is common practice to use state-of-the-art image retrieval algorithms for both of them. These algorithms are often trained for the goal of retrieving the same landmark under a large range of viewpoint changes which often differs from the requirements of visual localization. In order to investigate the consequences for visual localization, this paper focuses on understanding the role of image retrieval for multiple visual localization paradigms. First, we introduce a novel benchmark setup and compare state-of-the-art retrieval representations on multiple datasets using localization performance as metric. Second, we investigate several definitions of "ground truth" for image retrieval. Using these definitions as upper bounds for the visual localization paradigms, we show that there is still sgnificant room for improvement. Third, using these tools and in-depth analysis, we show that retrieval performance on classical landmark retrieval or place recognition tasks correlates only for some but not all paradigms to localization performance. Finally, we analyze the effects of blur and dynamic scenes in the images. We conclude that there is a need for retrieval approaches specifically designed for localization paradigms. Our benchmark and evaluation protocols are available at https://github.com/naver/kapture-localization.
    You Can't Count on Luck: Why Decision Transformers Fail in Stochastic Environments. (arXiv:2205.15967v1 [cs.LG])
    Recently, methods such as Decision Transformer that reduce reinforcement learning to a prediction task and solve it via supervised learning (RvS) have become popular due to their simplicity, robustness to hyperparameters, and strong overall performance on offline RL tasks. However, simply conditioning a probabilistic model on a desired return and taking the predicted action can fail dramatically in stochastic environments since trajectories that result in a return may have only achieved that return due to luck. In this work, we describe the limitations of RvS approaches in stochastic environments and propose a solution. Rather than simply conditioning on the return of a single trajectory as is standard practice, our proposed method, ESPER, learns to cluster trajectories and conditions on average cluster returns, which are independent from environment stochasticity. Doing so allows ESPER to achieve strong alignment between target return and expected performance in real environments. We demonstrate this in several challenging stochastic offline-RL tasks including the challenging puzzle game 2048, and Connect Four playing against a stochastic opponent. In all tested domains, ESPER achieves significantly better alignment between the target return and achieved return than simply conditioning on returns. ESPER also achieves higher maximum performance than even the value-based baselines.
    Contrastive Representation Learning for 3D Protein Structures. (arXiv:2205.15675v1 [q-bio.BM])
    Learning from 3D protein structures has gained wide interest in protein modeling and structural bioinformatics. Unfortunately, the number of available structures is orders of magnitude lower than the training data sizes commonly used in computer vision and machine learning. Moreover, this number is reduced even further, when only annotated protein structures can be considered, making the training of existing models difficult and prone to over-fitting. To address this challenge, we introduce a new representation learning framework for 3D protein structures. Our framework uses unsupervised contrastive learning to learn meaningful representations of protein structures, making use of proteins from the Protein Data Bank. We show, how these representations can be used to solve a large variety of tasks, such as protein function prediction, protein fold classification, structural similarity prediction, and protein-ligand binding affinity prediction. Moreover, we show how fine-tuned networks, pre-trained with our algorithm, lead to significantly improved task performance, achieving new state-of-the-art results in many tasks.
    Differentiable Invariant Causal Discovery. (arXiv:2205.15638v1 [cs.LG])
    Learning causal structure from observational data is a fundamental challenge in machine learning. The majority of commonly used differentiable causal discovery methods are non-identifiable, turning this problem into a continuous optimization task prone to data biases. In many real-life situations, data is collected from different environments, in which the functional relations remain consistent across environments, while the distribution of additive noises may vary. This paper proposes Differentiable Invariant Causal Discovery (DICD), utilizing the multi-environment information based on a differentiable framework to avoid learning spurious edges and wrong causal directions. Specifically, DICD aims to discover the environment-invariant causation while removing the environment-dependent correlation. We further formulate the constraint that enforces the target structure equation model to maintain optimal across the environments. Theoretical guarantees for the identifiability of proposed DICD are provided under mild conditions with enough environments. Extensive experiments on synthetic and real-world datasets verify that DICD outperforms state-of-the-art causal discovery methods up to 36% in SHD. Our code will be open-sourced upon acceptance.
    SymFormer: End-to-end symbolic regression using transformer-based architecture. (arXiv:2205.15764v1 [cs.CV])
    Novel view synthesis is a long-standing problem. In this work, we consider a variant of the problem where we are given only a few context views sparsely covering a scene or an object. The goal is to predict novel viewpoints in the scene, which requires learning priors. The current state of the art is based on Neural Radiance Fields (NeRFs), and while achieving impressive results, the methods suffer from long training times as they require evaluating thousands of 3D point samples via a deep neural network for each image. We propose a 2D-only method that maps multiple context views and a query pose to a new image in a single pass of a neural network. Our model uses a two-stage architecture consisting of a codebook and a transformer model. The codebook is used to embed individual images into a smaller latent space, and the transformer solves the view synthesis task in this more compact space. To train our model efficiently, we introduce a novel branching attention mechanism that allows us to use the same model not only for neural rendering but also for camera pose estimation. Experimental results on real-world scenes show that our approach is competitive compared to NeRF-based methods while not reasoning in 3D, and it is faster to train.
    Multilingual Transformers for Product Matching -- Experiments and a New Benchmark in Polish. (arXiv:2205.15712v1 [cs.CL])
    Product matching corresponds to the task of matching identical products across different data sources. It typically employs available product features which, apart from being multimodal, i.e., comprised of various data types, might be non-homogeneous and incomplete. The paper shows that pre-trained, multilingual Transformer models, after fine-tuning, are suitable for solving the product matching problem using textual features both in English and Polish languages. We tested multilingual mBERT and XLM-RoBERTa models in English on Web Data Commons - training dataset and gold standard for large-scale product matching. The obtained results show that these models perform similarly to the latest solutions tested on this set, and in some cases, the results were even better. Additionally, we prepared a new dataset -- ProductMatch.pl -- that is entirely in Polish and based on offers in selected categories obtained from several online stores for the research purpose. It is the first open dataset for product matching tasks in Polish, which allows comparing the effectiveness of the pre-trained models. Thus, we also showed the baseline results obtained by the fine-tuned mBERT and XLM-RoBERTa models on the Polish datasets.
    Robust Projection based Anomaly Extraction (RPE) in Univariate Time-Series. (arXiv:2205.15548v1 [stat.ML])
    This paper presents a novel, closed-form, and data/computation efficient online anomaly detection algorithm for time-series data. The proposed method, dubbed RPE, is a window-based method and in sharp contrast to the existing window-based methods, it is robust to the presence of anomalies in its window and it can distinguish the anomalies in time-stamp level. RPE leverages the linear structure of the trajectory matrix of the time-series and employs a robust projection step which makes the algorithm able to handle the presence of multiple arbitrarily large anomalies in its window. A closed-form/non-iterative algorithm for the robust projection step is provided and it is proved that it can identify the corrupted time-stamps. RPE is a great candidate for the applications where a large training data is not available which is the common scenario in the area of time-series. An extensive set of numerical experiments show that RPE can outperform the existing approaches with a notable margin.
    HW-Aware Initialization of DNN Auto-Tuning to Improve Exploration Time and Robustness. (arXiv:2205.15568v1 [cs.LG])
    The process of optimizing the latency of DNN operators with ML models and hardware-in-the-loop, called auto-tuning, has established itself as a pervasive method for the deployment of neural networks. From a search space of loop-optimizations, the candidate providing the best performance has to be selected. Performance of individual configurations is evaluated through hardware measurements. The combinatorial explosion of possible configurations, together with the cost of hardware evaluation makes exhaustive explorations of the search space infeasible in practice. Machine Learning methods, like random forests or reinforcement learning are used to aid in the selection of candidates for hardware evaluation. For general purpose hardware like x86 and GPGPU architectures impressive performance gains can be achieved, compared to hand-optimized libraries like cuDNN. The method is also useful in the space of hardware accelerators with less wide-spread adoption, where a high-performance library is not always available. However, hardware accelerators are often less flexible with respect to their programming which leads to operator configurations not executable on the hardware target. This work evaluates how these invalid configurations affect the auto-tuning process and its underlying performance prediction model for the VTA hardware. From these results, a validity-driven initialization method for AutoTVM is developed, only requiring 41.6% of the necessary hardware measurements to find the best solution, while improving search robustness.
    Automatic Relation-aware Graph Network Proliferation. (arXiv:2205.15678v1 [cs.LG])
    Graph neural architecture search has sparked much attention as Graph Neural Networks (GNNs) have shown powerful reasoning capability in many relational tasks. However, the currently used graph search space overemphasizes learning node features and neglects mining hierarchical relational information. Moreover, due to diverse mechanisms in the message passing, the graph search space is much larger than that of CNNs. This hinders the straightforward application of classical search strategies for exploring complicated graph search space. We propose Automatic Relation-aware Graph Network Proliferation (ARGNP) for efficiently searching GNNs with a relation-guided message passing mechanism. Specifically, we first devise a novel dual relation-aware graph search space that comprises both node and relation learning operations. These operations can extract hierarchical node/relational information and provide anisotropic guidance for message passing on a graph. Second, analogous to cell proliferation, we design a network proliferation search paradigm to progressively determine the GNN architectures by iteratively performing network division and differentiation. The experiments on six datasets for four graph learning tasks demonstrate that GNNs produced by our method are superior to the current state-of-the-art hand-crafted and search-based GNNs. Codes are available at https://github.com/phython96/ARGNP.
    Secure Federated Clustering. (arXiv:2205.15564v1 [cs.LG])
    We consider a foundational unsupervised learning task of $k$-means data clustering, in a federated learning (FL) setting consisting of a central server and many distributed clients. We develop SecFC, which is a secure federated clustering algorithm that simultaneously achieves 1) universal performance: no performance loss compared with clustering over centralized data, regardless of data distribution across clients; 2) data privacy: each client's private data and the cluster centers are not leaked to other clients and the server. In SecFC, the clients perform Lagrange encoding on their local data and share the coded data in an information-theoretically private manner; then leveraging the algebraic structure of the coding, the FL network exactly executes the Lloyd's $k$-means heuristic over the coded data to obtain the final clustering. Experiment results on synthetic and real datasets demonstrate the universally superior performance of SecFC for different data distributions across clients, and its computational practicality for various combinations of system parameters. Finally, we propose an extension of SecFC to further provide membership privacy for all data points.
    Scalable Distributional Robustness in a Class of Non Convex Optimization with Guarantees. (arXiv:2205.15624v1 [cs.LG])
    Distributionally robust optimization (DRO) has shown lot of promise in providing robustness in learning as well as sample based optimization problems. We endeavor to provide DRO solutions for a class of sum of fractionals, non-convex optimization which is used for decision making in prominent areas such as facility location and security games. In contrast to previous work, we find it more tractable to optimize the equivalent variance regularized form of DRO rather than the minimax form. We transform the variance regularized form to a mixed-integer second order cone program (MISOCP), which, while guaranteeing near global optimality, does not scale enough to solve problems with real world data-sets. We further propose two abstraction approaches based on clustering and stratified sampling to increase scalability, which we then use for real world data-sets. Importantly, we provide near global optimality guarantees for our approach and show experimentally that our solution quality is better than the locally optimal ones achieved by state-of-the-art gradient-based methods. We experimentally compare our different approaches and baselines, and reveal nuanced properties of a DRO solution.
    Multi-task Optimization Based Co-training for Electricity Consumption Prediction. (arXiv:2205.15663v1 [cs.LG])
    Real-world electricity consumption prediction may involve different tasks, e.g., prediction for different time steps ahead or different geo-locations. These tasks are often solved independently without utilizing some common problem-solving knowledge that could be extracted and shared among these tasks to augment the performance of solving each task. In this work, we propose a multi-task optimization (MTO) based co-training (MTO-CT) framework, where the models for solving different tasks are co-trained via an MTO paradigm in which solving each task may benefit from the knowledge gained from when solving some other tasks to help its solving process. MTO-CT leverages long short-term memory (LSTM) based model as the predictor where the knowledge is represented via connection weights and biases. In MTO-CT, an inter-task knowledge transfer module is designed to transfer knowledge between different tasks, where the most helpful source tasks are selected by using the probability matching and stochastic universal selection, and evolutionary operations like mutation and crossover are performed for reusing the knowledge from selected source tasks in a target task. We use electricity consumption data from five states in Australia to design two sets of tasks at different scales: a) one-step ahead prediction for each state (five tasks) and b) 6-step, 12-step, 18-step, and 24-step ahead prediction for each state (20 tasks). The performance of MTO-CT is evaluated on solving each of these two sets of tasks in comparison to solving each task in the set independently without knowledge sharing under the same settings, which demonstrates the superiority of MTO-CT in terms of prediction accuracy.
    GSR: A Generalized Symbolic Regression Approach. (arXiv:2205.15569v1 [cs.LG])
    Identifying the mathematical relationships that best describe a dataset remains a very challenging problem in machine learning, and is known as Symbolic Regression (SR). In contrast to neural networks which are often treated as black boxes, SR attempts to gain insight into the underlying relationships between the independent variables and the target variable of a given dataset by assembling analytical functions. In this paper, we present GSR, a Generalized Symbolic Regression approach, by modifying the conventional SR optimization problem formulation, while keeping the main SR objective intact. In GSR, we infer mathematical relationships between the independent variables and some transformation of the target variable. We constrain our search space to a weighted sum of basis functions, and propose a genetic programming approach with a matrix-based encoding scheme. We show that our GSR method outperforms several state-of-the-art methods on the well-known SR benchmark problem sets. Finally, we highlight the strengths of GSR by introducing SymSet, a new SR benchmark set which is more challenging relative to the existing benchmarks.
    The CLRS Algorithmic Reasoning Benchmark. (arXiv:2205.15659v1 [cs.LG])
    Learning representations of algorithms is an emerging area of machine learning, seeking to bridge concepts from neural networks with classical algorithms. Several important works have investigated whether neural networks can effectively reason like algorithms, typically by learning to execute them. The common trend in the area, however, is to generate targeted kinds of algorithmic data to evaluate specific hypotheses, making results hard to transfer across publications, and increasing the barrier of entry. To consolidate progress and work towards unified evaluation, we propose the CLRS Algorithmic Reasoning Benchmark, covering classical algorithms from the Introduction to Algorithms textbook. Our benchmark spans a variety of algorithmic reasoning procedures, including sorting, searching, dynamic programming, graph algorithms, string algorithms and geometric algorithms. We perform extensive experiments to demonstrate how several popular algorithmic reasoning baselines perform on these tasks, and consequently, highlight links to several open challenges. Our library is readily available at https://github.com/deepmind/clrs.
    VC Theoretical Explanation of Double Descent. (arXiv:2205.15549v1 [stat.ML])
    There has been growing interest in generalization performance of large multilayer neural networks that can be trained to achieve zero training error, while generalizing well on test data. This regime is known as 'second descent' and it appears to contradict conventional view that optimal model complexity should reflect optimal balance between underfitting and overfitting, aka the bias-variance trade-off. This paper presents VC-theoretical analysis of double descent and shows that it can be fully explained by classical VC generalization bounds. We illustrate an application of analytic VC-bounds for modeling double descent for classification problems, using empirical results for several learning methods, such as SVM, Least Squares, and Multilayer Perceptron classifiers. In addition, we discuss several possible reasons for misinterpretation of VC-theoretical results in the machine learning community.
    Simulation-Based Inference with WALDO: Perfectly Calibrated Confidence Regions Using Any Prediction or Posterior Estimation Algorithm. (arXiv:2205.15680v1 [stat.ML])
    The vast majority of modern machine learning targets prediction problems, with algorithms such as Deep Neural Networks revolutionizing the accuracy of point predictions for high-dimensional complex data. Predictive approaches are now used in many domain sciences to directly estimate internal parameters of interest in theoretical simulator-based models. In parallel, common alternatives focus on estimating the full posterior using modern neural density estimators such as normalizing flows. However, an open problem in simulation-based inference (SBI) is how to construct properly calibrated confidence regions for internal parameters with nominal conditional coverage and high power. Many SBI methods are indeed known to produce overly confident posterior approximations, yielding misleading uncertainty estimates. Similarly, existing approaches for uncertainty quantification in deep learning provide no guarantees on conditional coverage. In this work, we present WALDO, a novel method for constructing correctly calibrated confidence regions in SBI. WALDO reframes the well-known Wald test and uses Neyman inversion to convert point predictions and posteriors from any prediction or posterior estimation algorithm to confidence sets with correct conditional coverage, even for finite sample sizes. As a concrete example, we demonstrate how a recently proposed deep learning prediction approach for particle energies in high-energy physics can be recalibrated using WALDO to produce confidence intervals with correct coverage and high power.
    Mitigating Dataset Bias by Using Per-sample Gradient. (arXiv:2205.15704v1 [cs.LG])
    The performance of deep neural networks is strongly influenced by the training dataset setup. In particular, when attributes having a strong correlation with the target attribute are present, the trained model can provide unintended prejudgments and show significant inference errors (i.e., the dataset bias problem). Various methods have been proposed to mitigate dataset bias, and their emphasis is on weakly correlated samples, called bias-conflicting samples. These methods are based on explicit bias labels involving human or empirical correlation metrics (e.g., training loss). However, such metrics require human costs or have insufficient theoretical explanation. In this study, we propose a debiasing algorithm, called PGD (Per-sample Gradient-based Debiasing), that comprises three steps: (1) training a model on uniform batch sampling, (2) setting the importance of each sample in proportion to the norm of the sample gradient, and (3) training the model using importance-batch sampling, whose probability is obtained in step (2). Compared with existing baselines for various synthetic and real-world datasets, the proposed method showed state-of-the-art accuracy for a the classification task. Furthermore, we describe theoretical understandings about how PGD can mitigate dataset bias.
    Individual health-disease phase diagrams for disease prevention based on machine learning. (arXiv:2205.15598v1 [cs.LG])
    Early disease detection and prevention methods based on effective interventions are gaining attention. Machine learning technology has enabled precise disease prediction by capturing individual differences in multivariate data. Progress in precision medicine has revealed that substantial heterogeneity exists in health data at the individual level and that complex health factors are involved in the development of chronic diseases. However, it remains a challenge to identify individual physiological state changes in cross-disease onset processes because of the complex relationships among multiple biomarkers. Here, we present the health-disease phase diagram (HDPD), which represents a personal health state by visualizing the boundary values of multiple biomarkers that fluctuate early in the disease progression process. In HDPDs, future onset predictions are represented by perturbing multiple biomarker values while accounting for dependencies among variables. We constructed HDPDs for 11 non-communicable diseases (NCDs) from a longitudinal health checkup cohort of 3,238 individuals, comprising 3,215 measurement items and genetic data. Improvement of biomarker values to the non-onset region in HDPD significantly prevented future disease onset in 7 out of 11 NCDs. Our results demonstrate that HDPDs can represent individual physiological states in the onset process and be used as intervention goals for disease prevention.
    A Closer Look at Invalid Action Masking in Policy Gradient Algorithms. (arXiv:2006.14171v3 [cs.LG] UPDATED)
    In recent years, Deep Reinforcement Learning (DRL) algorithms have achieved state-of-the-art performance in many challenging strategy games. Because these games have complicated rules, an action sampled from the full discrete action distribution predicted by the learned policy is likely to be invalid according to the game rules (e.g., walking into a wall). The usual approach to deal with this problem in policy gradient algorithms is to "mask out" invalid actions and just sample from the set of valid actions. The implications of this process, however, remain under-investigated. In this paper, we 1) show theoretical justification for such a practice, 2) empirically demonstrate its importance as the space of invalid actions grows, and 3) provide further insights by evaluating different action masking regimes, such as removing masking after an agent has been trained using masking. The source code can be found at https://github.com/vwxyzjn/invalid-action-masking
    ViNNPruner: Visual Interactive Pruning for Deep Learning. (arXiv:2205.15731v1 [cs.LG])
    Neural networks grow vastly in size to tackle more sophisticated tasks. In many cases, such large networks are not deployable on particular hardware and need to be reduced in size. Pruning techniques help to shrink deep neural networks to smaller sizes by only decreasing their performance as little as possible. However, such pruning algorithms are often hard to understand by applying them and do not include domain knowledge which can potentially be bad for user goals. We propose ViNNPruner, a visual interactive pruning application that implements state-of-the-art pruning algorithms and the option for users to do manual pruning based on their knowledge. We show how the application facilitates gaining insights into automatic pruning algorithms and semi-automatically pruning oversized networks to make them more efficient using interactive visualizations.
    Goal-Aware Neural SAT Solver. (arXiv:2106.07162v2 [cs.LG] UPDATED)
    Modern neural networks obtain information about the problem and calculate the output solely from the input values. We argue that it is not always optimal, and the network's performance can be significantly improved by augmenting it with a query mechanism that allows the network at run time to make several solution trials and get feedback on the loss value on each trial. To demonstrate the capabilities of the query mechanism, we formulate an unsupervised (not depending on labels) loss function for Boolean Satisfiability Problem (SAT) and theoretically show that it allows the network to extract rich information about the problem. We then propose a neural SAT solver with a query mechanism called QuerySAT and show that it outperforms the neural baseline on a wide range of SAT tasks.
    Unsupervised Image Representation Learning with Deep Latent Particles. (arXiv:2205.15821v1 [cs.CV])
    We propose a new representation of visual data that disentangles object position from appearance. Our method, termed Deep Latent Particles (DLP), decomposes the visual input into low-dimensional latent ``particles'', where each particle is described by its spatial location and features of its surrounding region. To drive learning of such representations, we follow a VAE-based approach and introduce a prior for particle positions based on a spatial-softmax architecture, and a modification of the evidence lower bound loss inspired by the Chamfer distance between particles. We demonstrate that our DLP representations are useful for downstream tasks such as unsupervised keypoint (KP) detection, image manipulation, and video prediction for scenes composed of multiple dynamic objects. In addition, we show that our probabilistic interpretation of the problem naturally provides uncertainty estimates for particle locations, which can be used for model selection, among other tasks. Videos and code are available: https://taldatech.github.io/deep-latent-particles-web/
    The Computational Drug Repositioning without Negative Sampling. (arXiv:2111.14696v3 [cs.LG] UPDATED)
    Computational drug repositioning technology is an effective tool to accelerate drug development. Although this technique has been widely used and successful in recent decades, many existing models still suffer from multiple drawbacks such as the massive number of unvalidated drug-disease associations and the inner product. The limitations of these works are mainly due to the following two reasons: firstly, previous works used negative sampling techniques to treat unvalidated drug-disease associations as negative samples, which is invalid in real-world settings; secondly, the inner product cannot fully take into account the feature information contained in the latent factor of drug and disease. In this paper, we propose a novel PUON framework for addressing the above deficiencies, which models the risk estimator of computational drug repositioning only using validated (Positive) and unvalidated (Unlabelled) drug-disease associations without employing negative sampling techniques. The PUON also proposed an Outer Neighborhood-based classifier for modeling the cross-feature information of the latent facotor. For a comprehensive comparison, we considered 8 popular baselines. Extensive experiments in four real-world datasets showed that PUON model achieved the best performance based on 6 evaluation metrics.
    Automatic Diagnosis of Schizophrenia and Attention Deficit Hyperactivity Disorder in rs-fMRI Modality using Convolutional Autoencoder Model and Interval Type-2 Fuzzy Regression. (arXiv:2205.15858v1 [cs.LG])
    Nowadays, many people worldwide suffer from brain disorders, and their health is in danger. So far, numerous methods have been proposed for the diagnosis of Schizophrenia (SZ) and attention deficit hyperactivity disorder (ADHD), among which functional magnetic resonance imaging (fMRI) modalities are known as a popular method among physicians. This paper presents an SZ and ADHD intelligent detection method of resting-state fMRI (rs-fMRI) modality using a new deep learning (DL) method. The University of California Los Angeles (UCLA) dataset, which contains the rs-fMRI modalities of SZ and ADHD patients, has been used for experiments. The FMRIB software library (FSL) toolbox first performed preprocessing on rs-fMRI data. Then, a convolutional Autoencoder (CNN-AE) model with the proposed number of layers is used to extract features from rs-fMRI data. In the classification step, a new fuzzy method called interval type-2 fuzzy regression (IT2FR) is introduced and then optimized by genetic algorithm (GA), particle swarm optimization (PSO), and gray wolf optimization (GWO) techniques. Also, the results of IT2FR methods are compared with multilayer perceptron (MLP), k-nearest neighbors (KNN), support vector machine (SVM), random forest (RF), decision tree (DT), and adaptive neuro-fuzzy inference system (ANFIS) methods. The experiment results show that the IT2FR method with the GWO optimization algorithm has achieved satisfactory results compared to other classifier methods. Finally, the proposed classification technique was able to provide 72.71% accuracy.
    Mean Field inference of CRFs based on GAT. (arXiv:2205.15312v1 [cs.LG])
    In this paper we propose an improved mean-field inference algorithm for the fully connected paired CRFs model. The improved method Message Passing operation is changed from the original linear convolution to the present graph attention operation, while the process of the inference algorithm is turned into the forward process of the GAT model. Combined with the mean-field inferred label distribution, it is equivalent to the output of a classifier with only unary potential. To this end, we propose a graph attention network model with residual structure, and the model approach is applicable to all sequence annotation tasks, such as pixel-level image semantic segmentation tasks as well as text annotation tasks.
    Nearly Minimax Optimal Offline Reinforcement Learning with Linear Function Approximation: Single-Agent MDP and Markov Game. (arXiv:2205.15512v1 [cs.LG])
    Offline reinforcement learning (RL) aims at learning an optimal strategy using a pre-collected dataset without further interactions with the environment. While various algorithms have been proposed for offline RL in the previous literature, the minimax optimal performance has only been (nearly) achieved for tabular Markov decision processes (MDPs). In this paper, we focus on offline RL with linear function approximation and propose two new algorithms, SPEVI+ and SPMVI+, for single-agent MDPs and two-player zero-sum Markov games (MGs), respectively. The proposed algorithms feature carefully crafted data splitting mechanisms and novel variance-reduction pessimistic estimators. Theoretical analysis demonstrates that they are capable of matching the performance lower bounds up to logarithmic factors. As a byproduct, a new performance lower bound is established for MGs, which tightens the existing results. To the best of our knowledge, these are the first computationally efficient and nearly minimax optimal algorithms for offline single-agent MDPs and MGs with linear function approximation.
    Variational Transfer Learning using Cross-Domain Latent Modulation. (arXiv:2205.15523v1 [cs.LG])
    To successfully apply trained neural network models to new domains, powerful transfer learning solutions are essential. We propose to introduce a novel cross-domain latent modulation mechanism to a variational autoencoder framework so as to achieve effective transfer learning. Our key idea is to procure deep representations from one data domain and use it to influence the reparameterization of the latent variable of another domain. Specifically, deep representations of the source and target domains are first extracted by a unified inference model and aligned by employing gradient reversal. The learned deep representations are then cross-modulated to the latent encoding of the alternative domain, where consistency constraints are also applied. In the empirical validation that includes a number of transfer learning benchmark tasks for unsupervised domain adaptation and image-to-image translation, our model demonstrates competitive performance, which is also supported by evidence obtained from visualization.
    Non-Markovian Reward Modelling from Trajectory Labels via Interpretable Multiple Instance Learning. (arXiv:2205.15367v1 [cs.LG])
    We generalise the problem of reward modelling (RM) for reinforcement learning (RL) to handle non-Markovian rewards. Existing work assumes that human evaluators observe each step in a trajectory independently when providing feedback on agent behaviour. In this work, we remove this assumption, extending RM to include hidden state information that captures temporal dependencies in human assessment of trajectories. We then show how RM can be approached as a multiple instance learning (MIL) problem, and develop new MIL models that are able to capture the time dependencies in labelled trajectories. We demonstrate on a range of RL tasks that our novel MIL models can reconstruct reward functions to a high level of accuracy, and that they provide interpretable learnt hidden information that can be used to train high-performing agent policies.
    A hybrid approach to seismic deblending: when physics meets self-supervision. (arXiv:2205.15395v1 [physics.geo-ph])
    To limit the time, cost, and environmental impact associated with the acquisition of seismic data, in recent decades considerable effort has been put into so-called simultaneous shooting acquisitions, where seismic sources are fired at short time intervals between each other. As a consequence, waves originating from consecutive shots are entangled within the seismic recordings, yielding so-called blended data. For processing and imaging purposes, the data generated by each individual shot must be retrieved. This process, called deblending, is achieved by solving an inverse problem which is heavily underdetermined. Conventional approaches rely on transformations that render the blending noise into burst-like noise, whilst preserving the signal of interest. Compressed sensing type regularization is then applied, where sparsity in some domain is assumed for the signal of interest. The domain of choice depends on the geometry of the acquisition and the properties of seismic data within the chosen domain. In this work, we introduce a new concept that consists of embedding a self-supervised denoising network into the Plug-and-Play (PnP) framework. A novel network is introduced whose design extends the blind-spot network architecture of [28 ] for partially coherent noise (i.e., correlated in time). The network is then trained directly on the noisy input data at each step of the PnP algorithm. By leveraging both the underlying physics of the problem and the great denoising capabilities of our blind-spot network, the proposed algorithm is shown to outperform an industry-standard method whilst being comparable in terms of computational cost. Moreover, being independent on the acquisition geometry, our method can be easily applied to both marine and land data without any significant modification.
    Bayesian Active Learning for Scanning Probe Microscopy: from Gaussian Processes to Hypothesis Learning. (arXiv:2205.15458v1 [cond-mat.mtrl-sci])
    Recent progress in machine learning methods, and the emerging availability of programmable interfaces for scanning probe microscopes (SPMs), have propelled automated and autonomous microscopies to the forefront of attention of the scientific community. However, enabling automated microscopy requires the development of task-specific machine learning methods, understanding the interplay between physics discovery and machine learning, and fully defined discovery workflows. This, in turn, requires balancing the physical intuition and prior knowledge of the domain scientist with rewards that define experimental goals and machine learning algorithms that can translate these to specific experimental protocols. Here, we discuss the basic principles of Bayesian active learning and illustrate its applications for SPM. We progress from the Gaussian Process as a simple data-driven method and Bayesian inference for physical models as an extension of physics-based functional fits to more complex deep kernel learning methods, structured Gaussian Processes, and hypothesis learning. These frameworks allow for the use of prior data, the discovery of specific functionalities as encoded in spectral data, and exploration of physical laws manifesting during the experiment. The discussed framework can be universally applied to all techniques combining imaging and spectroscopy, SPM methods, nanoindentation, electron microscopy and spectroscopy, and chemical imaging methods, and can be particularly impactful for destructive or irreversible measurements.
    Simplex Neural Population Learning: Any-Mixture Bayes-Optimality in Symmetric Zero-sum Games. (arXiv:2205.15879v1 [cs.AI])
    Learning to play optimally against any mixture over a diverse set of strategies is of important practical interests in competitive games. In this paper, we propose simplex-NeuPL that satisfies two desiderata simultaneously: i) learning a population of strategically diverse basis policies, represented by a single conditional network; ii) using the same network, learn best-responses to any mixture over the simplex of basis policies. We show that the resulting conditional policies incorporate prior information about their opponents effectively, enabling near optimal returns against arbitrary mixture policies in a game with tractable best-responses. We verify that such policies behave Bayes-optimally under uncertainty and offer insights in using this flexibility at test time. Finally, we offer evidence that learning best-responses to any mixture policies is an effective auxiliary task for strategic exploration, which, by itself, can lead to more performant populations.
    Graph Backup: Data Efficient Backup Exploiting Markovian Transitions. (arXiv:2205.15824v1 [cs.LG])
    The successes of deep Reinforcement Learning (RL) are limited to settings where we have a large stream of online experiences, but applying RL in the data-efficient setting with limited access to online interactions is still challenging. A key to data-efficient RL is good value estimation, but current methods in this space fail to fully utilise the structure of the trajectory data gathered from the environment. In this paper, we treat the transition data of the MDP as a graph, and define a novel backup operator, Graph Backup, which exploits this graph structure for better value estimation. Compared to multi-step backup methods such as $n$-step $Q$-Learning and TD($\lambda$), Graph Backup can perform counterfactual credit assignment and gives stable value estimates for a state regardless of which trajectory the state is sampled from. Our method, when combined with popular value-based methods, provides improved performance over one-step and multi-step methods on a suite of data-efficient RL benchmarks including MiniGrid, Minatar and Atari100K. We further analyse the reasons for this performance boost through a novel visualisation of the transition graphs of Atari games.
    Rethinking Graph Neural Networks for Anomaly Detection. (arXiv:2205.15508v1 [cs.LG])
    Graph Neural Networks (GNNs) are widely applied for graph anomaly detection. As one of the key components for GNN design is to select a tailored spectral filter, we take the first step towards analyzing anomalies via the lens of the graph spectrum. Our crucial observation is the existence of anomalies will lead to the `right-shift' phenomenon, that is, the spectral energy distribution concentrates less on low frequencies and more on high frequencies. This fact motivates us to propose the Beta Wavelet Graph Neural Network (BWGNN). Indeed, BWGNN has spectral and spatial localized band-pass filters to better handle the `right-shift' phenomenon in anomalies. We demonstrate the effectiveness of BWGNN on four large-scale anomaly detection datasets. Our code and data are released at https://github.com/squareRoot3/Rethinking-Anomaly-Detection
    SOM-CPC: Unsupervised Contrastive Learning with Self-Organizing Maps for Structured Representations of High-Rate Time Series. (arXiv:2205.15875v1 [cs.LG])
    Continuous monitoring with an ever-increasing number of sensors has become ubiquitous across many application domains. Acquired data are typically high-dimensional and difficult to interpret, but they are also hypothesized to lie on a lower-dimensional manifold. Many deep learning (DL) models aim to identify this manifold, but do not promote structure nor interpretability. We propose the SOM-CPC model, which jointly optimizes Contrastive Predictive Coding (CPC), and a Self-Organizing Map (SOM) to find such an organized manifold. We address a largely unexplored and challenging set of scenarios comprising high-rate time series, and show on synthetic and real-life medical and audio data that SOM-CPC outperforms strong baseline models that combine DL with SOMs. SOM-CPC has great potential to expose latent patterns in high-rate data streams, and may therefore contribute to a better understanding of many different processes and systems.
    Few-Shot Diffusion Models. (arXiv:2205.15463v1 [cs.CV])
    Denoising diffusion probabilistic models (DDPM) are powerful hierarchical latent variable models with remarkable sample generation quality and training stability. These properties can be attributed to parameter sharing in the generative hierarchy, as well as a parameter-free diffusion-based inference procedure. In this paper, we present Few-Shot Diffusion Models (FSDM), a framework for few-shot generation leveraging conditional DDPMs. FSDMs are trained to adapt the generative process conditioned on a small set of images from a given class by aggregating image patch information using a set-based Vision Transformer (ViT). At test time, the model is able to generate samples from previously unseen classes conditioned on as few as 5 samples from that class. We empirically show that FSDM can perform few-shot generation and transfer to new datasets. We benchmark variants of our method on complex vision datasets for few-shot learning and compare to unconditional and conditional DDPM baselines. Additionally, we show how conditioning the model on patch-based input set information improves training convergence.
    Searching for the Essence of Adversarial Perturbations. (arXiv:2205.15357v1 [cs.LG])
    Neural networks have achieved the state-of-the-art performance on various machine learning fields, yet the incorporation of malicious perturbations with input data (adversarial example) is able to fool neural networks' predictions. This would lead to potential risks in real-world applications, for example, auto piloting and facial recognition. However, the reason for the existence of adversarial examples remains controversial. Here we demonstrate that adversarial perturbations contain human-recognizable information, which is the key conspirator responsible for a neural network's erroneous prediction. This concept of human-recognizable information allows us to explain key features related to adversarial perturbations, which include the existence of adversarial examples, the transferability among different neural networks, and the increased neural network interpretability for adversarial training. Two unique properties in adversarial perturbations that fool neural networks are uncovered: masking and generation. A special class, the complementary class, is identified when neural networks classify input images. The human-recognizable information contained in adversarial perturbations allows researchers to gain insight on the working principles of neural networks and may lead to develop techniques that detect/defense adversarial attacks.
    Post-hoc Concept Bottleneck Models. (arXiv:2205.15480v1 [cs.LG])
    Concept Bottleneck Models (CBMs) map the inputs onto a set of interpretable concepts (``the bottleneck'') and use the concepts to make predictions. A concept bottleneck enhances interpretability since it can be investigated to understand what concepts the model "sees" in an input and which of these concepts are deemed important. However, CBMs are restrictive in practice as they require concept labels in the training data to learn the bottleneck and do not leverage strong pretrained models. Moreover, CBMs often do not match the accuracy of an unrestricted neural network, reducing the incentive to deploy them in practice. In this work, we address the limitations of CBMs by introducing Post-hoc Concept Bottleneck models (PCBMs). We show that we can turn any neural network into a PCBM without sacrificing model performance while still retaining interpretability benefits. When concept annotation is not available on the training data, we show that PCBM can transfer concepts from other datasets or from natural language descriptions of concepts. PCBM also enables users to quickly debug and update the model to reduce spurious correlations and improve generalization to new (potentially different) data. Through a model-editing user study, we show that editing PCBMs via concept-level feedback can provide significant performance gains without using any data from the target domain or model retraining.
    Fairness in the First Stage of Two-Stage Recommender Systems. (arXiv:2205.15436v1 [cs.IR])
    Many large-scale recommender systems consist of two stages, where the first stage focuses on efficiently generating a small subset of promising candidates from a huge pool of items for the second-stage model to curate final recommendations from. In this paper, we investigate how to ensure groups fairness to the items in this two-stage paradigm. In particular, we find that existing first-stage recommenders might select an irrecoverably unfair set of candidates such that there is no hope for the second-stage recommender to deliver fair recommendations. To this end, we propose two threshold-policy selection rules that, given any relevance model of queries and items and a point-wise lower confidence bound on the expected number of relevant items for each policy, find near-optimal sets of candidates that contain enough relevant items in expectation from each group of items. To instantiate the rules, we demonstrate how to derive such confidence bounds from potentially partial and biased user feedback data, which are abundant in many large-scale recommender systems. In addition, we provide both finite-sample and asymptotic analysis of how close the two threshold selection rules are to the optimal thresholds. Beyond this theoretical analysis, we show empirically that these two rules can consistently select enough relevant items from each group while minimizing the size of the candidate sets for a wide range of settings.
    Reinforcement Learning with a Terminator. (arXiv:2205.15376v1 [cs.LG])
    We present the problem of reinforcement learning with exogenous termination. We define the Termination Markov Decision Process (TerMDP), an extension of the MDP framework, in which episodes may be interrupted by an external non-Markovian observer. This formulation accounts for numerous real-world situations, such as a human interrupting an autonomous driving agent for reasons of discomfort. We learn the parameters of the TerMDP and leverage the structure of the estimation problem to provide state-wise confidence bounds. We use these to construct a provably-efficient algorithm, which accounts for termination, and bound its regret. Motivated by our theoretical analysis, we design and implement a scalable approach, which combines optimism (w.r.t. termination) and a dynamic discount factor, incorporating the termination probability. We deploy our method on high-dimensional driving and MinAtar benchmarks. Additionally, we test our approach on human data in a driving setting. Our results demonstrate fast convergence and significant improvement over various baseline approaches.
    Segmentation Consistency Training: Out-of-Distribution Generalization for Medical Image Segmentation. (arXiv:2205.15428v1 [cs.CV])
    Generalizability is seen as one of the major challenges in deep learning, in particular in the domain of medical imaging, where a change of hospital or in imaging routines can lead to a complete failure of a model. To tackle this, we introduce Consistency Training, a training procedure and alternative to data augmentation based on maximizing models' prediction consistency across augmented and unaugmented data in order to facilitate better out-of-distribution generalization. To this end, we develop a novel region-based segmentation loss function called Segmentation Inconsistency Loss (SIL), which considers the differences between pairs of augmented and unaugmented predictions and labels. We demonstrate that Consistency Training outperforms conventional data augmentation on several out-of-distribution datasets on polyp segmentation, a popular medical task.
    Neural Optimal Transport with General Cost Functionals. (arXiv:2205.15403v1 [cs.LG])
    We present a novel neural-networks-based algorithm to compute optimal transport (OT) plans and maps for general cost functionals. The algorithm is based on a saddle point reformulation of the OT problem and generalizes prior OT methods for weak and strong cost functionals. As an application, we construct a functional to map data distributions with preserving the class-wise structure of data.
    Multiscale modeling of inelastic materials with Thermodynamics-based Artificial Neural Networks (TANN). (arXiv:2108.13137v3 [cond-mat.mtrl-sci] UPDATED)
    The mechanical behavior of inelastic materials with microstructure is very complex and hard to grasp with heuristic, empirical constitutive models. For this purpose, multiscale, homogenization approaches are often used for performing reliable, accurate predictions of the macroscopic mechanical behavior of solids and structures. Nevertheless, the calculation cost of such approaches is extremely high and prohibitive for real-scale applications involving inelastic materials. Here, we propose the so-called Thermodynamics-based Artificial Neural Networks (TANN) for the constitutive modeling of materials with inelastic and complex microstructure. Our approach integrates thermodynamics-aware dimensionality reduction techniques and thermodynamics-based deep neural networks to identify, in an autonomous way, the constitutive laws and discover the internal state variables of complex inelastic materials. The efficiency and accuracy of TANN in predicting the average and local stress-strain response, the free-energy and the dissipation rate is demonstrated for both regular and perturbed two- and three-dimensional lattice microstructures in inelasticity. TANN manage to identify the internal state variables that characterize the inelastic deformation of the complex microstructural fields. These internal state variables are then used to reconstruct the microdeformation fields of the microstructure at a given state. Finally, a double-scale homogenization scheme (FEMxTANN) is used to solve a large scale boundary value problem. The high performance of the homogenized model using TANN is illustrated through detailed comparisons with microstructural calculations at large scale. An excellent agreement is shown for a variety of monotonous and cyclic stress-strain paths.
    Learning (Very) Simple Generative Models Is Hard. (arXiv:2205.16003v1 [cs.LG])
    Motivated by the recent empirical successes of deep generative models, we study the computational complexity of the following unsupervised learning problem. For an unknown neural network $F:\mathbb{R}^d\to\mathbb{R}^{d'}$, let $D$ be the distribution over $\mathbb{R}^{d'}$ given by pushing the standard Gaussian $\mathcal{N}(0,\textrm{Id}_d)$ through $F$. Given i.i.d. samples from $D$, the goal is to output any distribution close to $D$ in statistical distance. We show under the statistical query (SQ) model that no polynomial-time algorithm can solve this problem even when the output coordinates of $F$ are one-hidden-layer ReLU networks with $\log(d)$ neurons. Previously, the best lower bounds for this problem simply followed from lower bounds for supervised learning and required at least two hidden layers and $\mathrm{poly}(d)$ neurons [Daniely-Vardi '21, Chen-Gollakota-Klivans-Meka '22]. The key ingredient in our proof is an ODE-based construction of a compactly supported, piecewise-linear function $f$ with polynomially-bounded slopes such that the pushforward of $\mathcal{N}(0,1)$ under $f$ matches all low-degree moments of $\mathcal{N}(0,1)$.
    Molecular Dipole Moment Learning via Rotationally Equivariant Gaussian Process Regression with Derivatives in Molecular-orbital-based Machine Learning. (arXiv:2205.15510v1 [physics.chem-ph])
    This study extends the accurate and transferable molecular-orbital-based machine learning (MOB-ML) approach to modeling the contribution of electron correlation to dipole moments at the cost of Hartree-Fock computations. A molecular-orbital-based (MOB) pairwise decomposition of the correlation part of the dipole moment is applied, and these pair dipole moments could be further regressed as a universal function of molecular orbitals (MOs). The dipole MOB features consist of the energy MOB features and their responses to electric fields. An interpretable and rotationally equivariant Gaussian process regression (GPR) with derivatives algorithm is introduced to learn the dipole moment more efficiently. The proposed problem setup, feature design, and ML algorithm are shown to provide highly-accurate models for both dipole moment and energies on water and fourteen small molecules. To demonstrate the ability of MOB-ML to function as generalized density-matrix functionals for molecular dipole moments and energies of organic molecules, we further apply the proposed MOB-ML approach to train and test the molecules from the QM9 dataset. The application of local scalable GPR with Gaussian mixture model unsupervised clustering (GMM/GPR) scales up MOB-ML to a large-data regime while retaining the prediction accuracy. In addition, compared with literature results, MOB-ML provides the best test MAEs of 4.21 mDebye and 0.045 kcal/mol for dipole moment and energy models, respectively, when training on 110000 QM9 molecules. The excellent transferability of the resulting QM9 models is also illustrated by the accurate predictions for four different series of peptides.
    Timing is Everything: Learning to Act Selectively with Costly Actions and Budgetary Constraints. (arXiv:2205.15953v1 [cs.LG])
    Many real-world settings involve costs for performing actions; transaction costs in financial systems and fuel costs being common examples. In these settings, performing actions at each time step quickly accumulates costs leading to vastly suboptimal outcomes. Additionally, repeatedly acting produces wear and tear and ultimately, damage. Determining when to act is crucial for achieving successful outcomes and yet, the challenge of efficiently learning to behave optimally when actions incur minimally bounded costs remains unresolved. In this paper, we introduce a reinforcement learning (RL) framework named Learnable Impulse Control Reinforcement Algorithm (LICRA), for learning to optimally select both when to act and which actions to take when actions incur costs. At the core of LICRA is a nested structure that combines RL and a form of policy known as impulse control which learns to maximise objectives when actions incur costs. We prove that LICRA, which seamlessly adopts any RL method, converges to policies that optimally select when to perform actions and their optimal magnitudes. We then augment LICRA to handle problems in which the agent can perform at most $k<\infty$ actions and more generally, faces a budget constraint. We show LICRA learns the optimal value function and ensures budget constraints are satisfied almost surely. We demonstrate empirically LICRA's superior performance against benchmark RL methods in OpenAI gym's Lunar Lander and in Highway environments and a variant of the Merton portfolio problem within finance.
    Posterior and Computational Uncertainty in Gaussian Processes. (arXiv:2205.15449v1 [cs.LG])
    Gaussian processes scale prohibitively with the size of the dataset. In response, many approximation methods have been developed, which inevitably introduce approximation error. This additional source of uncertainty, due to limited computation, is entirely ignored when using the approximate posterior. Therefore in practice, GP models are often as much about the approximation method as they are about the data. Here, we develop a new class of methods that provides consistent estimation of the combined uncertainty arising from both the finite number of data observed and the finite amount of computation expended. The most common GP approximations map to an instance in this class, such as methods based on the Cholesky factorization, conjugate gradients, and inducing points. For any method in this class, we prove (i) convergence of its posterior mean in the associated RKHS, (ii) decomposability of its combined posterior covariance into mathematical and computational covariances, and (iii) that the combined variance is a tight worst-case bound for the squared error between the method's posterior mean and the latent function. Finally, we empirically demonstrate the consequences of ignoring computational uncertainty and show how implicitly modeling it improves generalization performance on benchmark datasets.
    FedWalk: Communication Efficient Federated Unsupervised Node Embedding with Differential Privacy. (arXiv:2205.15896v1 [cs.DC])
    Node embedding aims to map nodes in the complex graph into low-dimensional representations. The real-world large-scale graphs and difficulties of labeling motivate wide studies of unsupervised node embedding problems. Nevertheless, previous effort mostly operates in a centralized setting where a complete graph is given. With the growing awareness of data privacy, data holders who are only aware of one vertex and its neighbours demand greater privacy protection. In this paper, we introduce FedWalk, a random-walk-based unsupervised node embedding algorithm that operates in such a node-level visibility graph with raw graph information remaining locally. FedWalk is designed to offer centralized competitive graph representation capability with data privacy protection and great communication efficiency. FedWalk instantiates the prevalent federated paradigm and contains three modules. We first design a hierarchical clustering tree (HCT) constructor to extract the structural feature of each node. A dynamic time wrapping algorithm seamlessly handles the structural heterogeneity across different nodes. Based on the constructed HCT, we then design a random walk generator, wherein a sequence encoder is designed to preserve privacy and a two-hop neighbor predictor is designed to save communication cost. The generated random walks are then used to update node embedding based on a SkipGram model. Extensive experiments on two large graphs demonstrate that Fed-Walk achieves competitive representativeness as a centralized node embedding algorithm does with only up to 1.8% Micro-F1 score and 4.4% Marco-F1 score loss while reducing about 6.7 times of inter-device communication per walk.
    Fooling SHAP with Stealthily Biased Sampling. (arXiv:2205.15419v1 [cs.LG])
    SHAP explanations aim at identifying which features contribute the most to the difference in model prediction at a specific input versus a background distribution. Recent studies have shown that they can be manipulated by malicious adversaries to produce arbitrary desired explanations. However, existing attacks focus solely on altering the black-box model itself. In this paper, we propose a complementary family of attacks that leave the model intact and manipulate SHAP explanations using stealthily biased sampling of the data points used to approximate expectations w.r.t the background distribution. In the context of fairness audit, we show that our attack can reduce the importance of a sensitive feature when explaining the difference in outcomes between groups, while remaining undetected. These results highlight the manipulability of SHAP explanations and encourage auditors to treat post-hoc explanations with skepticism.
    A Unified Weight Initialization Paradigm for Tensorial Convolutional Neural Networks. (arXiv:2205.15307v1 [cs.LG])
    Tensorial Convolutional Neural Networks (TCNNs) have attracted much research attention for their power in reducing model parameters or enhancing the generalization ability. However, exploration of TCNNs is hindered even from weight initialization methods. To be specific, general initialization methods, such as Xavier or Kaiming initialization, usually fail to generate appropriate weights for TCNNs. Meanwhile, although there are ad-hoc approaches for specific architectures (e.g., Tensor Ring Nets), they are not applicable to TCNNs with other tensor decomposition methods (e.g., CP or Tucker decomposition). To address this problem, we propose a universal weight initialization paradigm, which generalizes Xavier and Kaiming methods and can be widely applicable to arbitrary TCNNs. Specifically, we first present the Reproducing Transformation to convert the backward process in TCNNs to an equivalent convolution process. Then, based on the convolution operators in the forward and backward processes, we build a unified paradigm to control the variance of features and gradients in TCNNs. Thus, we can derive fan-in and fan-out initialization for various TCNNs. We demonstrate that our paradigm can stabilize the training of TCNNs, leading to faster convergence and better results.
    Painful intelligence: What AI can tell us about human suffering. (arXiv:2205.15409v1 [cs.LG])
    This book uses the modern theory of artificial intelligence (AI) to understand human suffering or mental pain. Both humans and sophisticated AI agents process information about the world in order to achieve goals and obtain rewards, which is why AI can be used as a model of the human brain and mind. This book intends to make the theory accessible to a relatively general audience, requiring only some relevant scientific background. The book starts with the assumption that suffering is mainly caused by frustration. Frustration means the failure of an agent (whether AI or human) to achieve a goal or a reward it wanted or expected. Frustration is inevitable because of the overwhelming complexity of the world, limited computational resources, and scarcity of good data. In particular, such limitations imply that an agent acting in the real world must cope with uncontrollability, unpredictability, and uncertainty, which all lead to frustration. Fundamental in such modelling is the idea of learning, or adaptation to the environment. While AI uses machine learning, humans and animals adapt by a combination of evolutionary mechanisms and ordinary learning. Even frustration is fundamentally an error signal that the system uses for learning. This book explores various aspects and limitations of learning algorithms and their implications regarding suffering. At the end of the book, the computational theory is used to derive various interventions or training methods that will reduce suffering in humans. The amount of frustration is expressed by a simple equation which indicates how it can be reduced. The ensuing interventions are very similar to those proposed by Buddhist and Stoic philosophy, and include mindfulness meditation. Therefore, this book can be interpreted as an exposition of a computational theory justifying why such philosophies and meditation reduce human suffering.
    Continual Object Detection: A review of definitions, strategies, and challenges. (arXiv:2205.15445v1 [cs.CV])
    The field of Continual Learning investigates the ability to learn consecutive tasks without losing performance on those previously learned. Its focus has been mainly on incremental classification tasks. We believe that research in continual object detection deserves even more attention due to its vast range of applications in robotics and autonomous vehicles. This scenario is more complex than conventional classification given the occurrence of instances of classes that are unknown at the time, but can appear in subsequent tasks as a new class to be learned, resulting in missing annotations and conflicts with the background label. In this review, we analyze the current strategies proposed to tackle the problem of class-incremental object detection. Our main contributions are: (1) a short and systematic review of the methods that propose solutions to traditional incremental object detection scenarios; (2) A comprehensive evaluation of the existing approaches using a new metric to quantify the stability and plasticity of each technique in a standard way; (3) an overview of the current trends within continual object detection and a discussion of possible future research directions.
    Hierarchies of Reward Machines. (arXiv:2205.15752v1 [cs.LG])
    Reward machines (RMs) are a recent formalism for representing the reward function of a reinforcement learning task through a finite-state machine whose edges encode landmarks of the task using high-level events. The structure of RMs enables the decomposition of a task into simpler and independently solvable subtasks that help tackle long-horizon and/or sparse reward tasks. We propose a formalism for further abstracting the subtask structure by endowing an RM with the ability to call other RMs, thus composing a hierarchy of RMs (HRM). We exploit HRMs by treating each call to an RM as an independently solvable subtask using the options framework, and describe a curriculum-based method to induce HRMs from example traces observed by the agent. Our experiments reveal that exploiting a handcrafted HRM leads to faster convergence than with a flat HRM, and that learning an HRM is more scalable than learning an equivalent flat HRM.
    Learning brain MRI quality control: a multi-factorial generalization problem. (arXiv:2205.15898v1 [stat.ML])
    Due to the growing number of MRI data, automated quality control (QC) has become essential, especially for larger scale analysis. Several attempts have been made in order to develop reliable and scalable QC pipelines. However, the generalization of these methods on new data independent of those used for learning is a difficult problem because of the biases inherent in MRI data. This work aimed at evaluating the performances of the MRIQC pipeline on various large-scale datasets (ABIDE, N = 1102 and CATI derived datasets, N = 9037) used for both training and evaluation purposes. We focused our analysis on the MRIQC preprocessing steps and tested the pipeline with and without them. We further analyzed the site-wise and study-wise predicted classification probability distributions of the models without preprocessing trained on ABIDE and CATI data. Our main results were that a model using features extracted from MRIQC without preprocessing yielded the best results when trained and evaluated on large multi-center datasets with a heterogeneous population (an improvement of the ROC-AUC score on unseen data of 0.10 for the model trained on a subset of the CATI dataset). We concluded that a model trained with data from a heterogeneous population, such as the CATI dataset, provides the best scores on unseen data. In spite of the performance improvement, the generalization abilities of the models remain questionable when looking at the site-wise/study-wise probability predictions and the optimal classification threshold derived from them.
    Designing Rewards for Fast Learning. (arXiv:2205.15400v1 [cs.LG])
    To convey desired behavior to a Reinforcement Learning (RL) agent, a designer must choose a reward function for the environment, arguably the most important knob designers have in interacting with RL agents. Although many reward functions induce the same optimal behavior (Ng et al., 1999), in practice, some of them result in faster learning than others. In this paper, we look at how reward-design choices impact learning speed and seek to identify principles of good reward design that quickly induce target behavior. This reward-identification problem is framed as an optimization problem: Firstly, we advocate choosing state-based rewards that maximize the action gap, making optimal actions easy to distinguish from suboptimal ones. Secondly, we propose minimizing a measure of the horizon, something we call the "subjective discount", over which rewards need to be optimized to encourage agents to make optimal decisions with less lookahead. To solve this optimization problem, we propose a linear-programming based algorithm that efficiently finds a reward function that maximizes action gap and minimizes subjective discount. We test the rewards generated with the algorithm in tabular environments with Q-Learning, and empirically show they lead to faster learning. Although we only focus on Q-Learning because it is perhaps the simplest and most well understood RL algorithm, preliminary results with R-max (Brafman and Tennenholtz, 2000) suggest our results are much more general. Our experiments support three principles of reward design: 1) consistent with existing results, penalizing each step taken induces faster learning than rewarding the goal. 2) When rewarding subgoals along the target trajectory, rewards should gradually increase as the goal gets closer. 3) Dense reward that's nonzero on every state is only good if designed carefully.
    Holistic Generalized Linear Models. (arXiv:2205.15447v1 [stat.ML])
    Holistic linear regression extends the classical best subset selection problem by adding additional constraints designed to improve the model quality. These constraints include sparsity-inducing constraints, sign-coherence constraints and linear constraints. The $\textsf{R}$ package $\texttt{holiglm}$ provides functionality to model and fit holistic generalized linear models. By making use of state-of-the-art conic mixed-integer solvers, the package can reliably solve GLMs for Gaussian, binomial and Poisson responses with a multitude of holistic constraints. The high-level interface simplifies the constraint specification and can be used as a drop-in replacement for the $\texttt{stats::glm()}$ function.
    Predicting Day-Ahead Stock Returns using Search Engine Query Volumes: An Application of Gradient Boosted Decision Trees to the S&P 100. (arXiv:2205.15853v1 [econ.EM])
    The internet has changed the way we live, work and take decisions. As it is the major modern resource for research, detailed data on internet usage exhibits vast amounts of behavioral information. This paper aims to answer the question whether this information can be facilitated to predict future returns of stocks on financial capital markets. In an empirical analysis it implements gradient boosted decision trees to learn relationships between abnormal returns of stocks within the S&P 100 index and lagged predictors derived from historical financial data, as well as search term query volumes on the internet search engine Google. Models predict the occurrence of day-ahead stock returns in excess of the index median. On a time frame from 2005 to 2017, all disparate datasets exhibit valuable information. Evaluated models have average areas under the receiver operating characteristic between 54.2% and 56.7%, clearly indicating a classification better than random guessing. Implementing a simple statistical arbitrage strategy, models are used to create daily trading portfolios of ten stocks and result in annual performances of more than 57% before transaction costs. With ensembles of different data sets topping up the performance ranking, the results further question the weak form and semi-strong form efficiency of modern financial capital markets. Even though transaction costs are not included, the approach adds to the existing literature. It gives guidance on how to use and transform data on internet usage behavior for financial and economic modeling and forecasting.
    Critic Sequential Monte Carlo. (arXiv:2205.15460v1 [stat.ML])
    We introduce CriticSMC, a new algorithm for planning as inference built from a novel composition of sequential Monte Carlo with learned soft-Q function heuristic factors. This algorithm is structured so as to allow using large numbers of putative particles leading to efficient utilization of computational resource and effective discovery of high reward trajectories even in environments with difficult reward surfaces such as those arising from hard constraints. Relative to prior art our approach is notably still compatible with model-free reinforcement learning in the sense that the implicit policy we produce can be used at test time in the absence of a world model. Our experiments on self-driving car collision avoidance in simulation demonstrate improvements against baselines in terms of infraction minimization relative to computational effort while maintaining diversity and realism of found trajectories.
    PolypConnect: Image inpainting for generating realistic gastrointestinal tract images with polyps. (arXiv:2205.15413v1 [eess.IV])
    Early identification of a polyp in the lower gastrointestinal (GI) tract can lead to prevention of life-threatening colorectal cancer. Developing computer-aided diagnosis (CAD) systems to detect polyps can improve detection accuracy and efficiency and save the time of the domain experts called endoscopists. Lack of annotated data is a common challenge when building CAD systems. Generating synthetic medical data is an active research area to overcome the problem of having relatively few true positive cases in the medical domain. To be able to efficiently train machine learning (ML) models, which are the core of CAD systems, a considerable amount of data should be used. In this respect, we propose the PolypConnect pipeline, which can convert non-polyp images into polyp images to increase the size of training datasets for training. We present the whole pipeline with quantitative and qualitative evaluations involving endoscopists. The polyp segmentation model trained using synthetic data, and real data shows a 5.1% improvement of mean intersection over union (mIOU), compared to the model trained only using real data. The codes of all the experiments are available on GitHub to reproduce the results.
    Chefs' Random Tables: Non-Trigonometric Random Features. (arXiv:2205.15317v1 [cs.LG])
    We introduce chefs' random tables (CRTs), a new class of non-trigonometric random features (RFs) to approximate Gaussian and softmax kernels. CRTs are an alternative to standard random kitchen sink (RKS) methods, which inherently rely on the trigonometric maps. We present variants of CRTs where RFs are positive, a key requirement for applications in recent low-rank Transformers. Further variance reduction is possible by leveraging statistics which are simple to compute. One instantiation of CRTs, the optimal positive random features (OPRFs), is to our knowledge the first RF method for unbiased softmax kernel estimation with positive and bounded RFs, resulting in exponentially small tails and much lower variance than its counterparts. As we show, orthogonal random features applied in OPRFs provide additional variance reduction for any dimensionality $d$ (not only asymptotically for sufficiently large $d$, as for RKS). We test CRTs on many tasks ranging from non-parametric classification to training Transformers for text, speech and image data, obtaining new state-of-the-art results for low-rank text Transformers, while providing linear space and time complexity.
    Superposing Many Tickets into One: A Performance Booster for Sparse Neural Network Training. (arXiv:2205.15322v1 [cs.LG])
    Recent works on sparse neural network training (sparse training) have shown that a compelling trade-off between performance and efficiency can be achieved by training intrinsically sparse neural networks from scratch. Existing sparse training methods usually strive to find the best sparse subnetwork possible in one single run, without involving any expensive dense or pre-training steps. For instance, dynamic sparse training (DST), as one of the most prominent directions, is capable of reaching a competitive performance of dense training by iteratively evolving the sparse topology during the course of training. In this paper, we argue that it is better to allocate the limited resources to create multiple low-loss sparse subnetworks and superpose them into a stronger one, instead of allocating all resources entirely to find an individual subnetwork. To achieve this, two desiderata are required: (1) efficiently producing many low-loss subnetworks, the so-called cheap tickets, within one training process limited to the standard training time used in dense training; (2) effectively superposing these cheap tickets into one stronger subnetwork without going over the constrained parameter budget. To corroborate our conjecture, we present a novel sparse training approach, termed \textbf{Sup-tickets}, which can satisfy the above two desiderata concurrently in a single sparse-to-sparse training process. Across various modern architectures on CIFAR-10/100 and ImageNet, we show that Sup-tickets integrates seamlessly with the existing sparse training methods and demonstrates consistent performance improvement.
    Infinite-dimensional optimization and Bayesian nonparametric learning of stochastic differential equations. (arXiv:2205.15368v1 [stat.ML])
    The paper has two major themes. The first part of the paper establishes certain general results for infinite-dimensional optimization problems on Hilbert spaces. These results cover the classical representer theorem and many of its variants as special cases and offer a wider scope of applications. The second part of the paper then develops a systematic approach for learning the drift function of a stochastic differential equation by integrating the results of the first part with Bayesian hierarchical framework. Importantly, our Baysian approach incorporates low-cost sparse learning through proper use of shrinkage priors while allowing proper quantification of uncertainty through posterior distributions. Several examples at the end illustrate the accuracy of our learning scheme.
    A Design Space for Explainable Ranking and Ranking Models. (arXiv:2205.15305v1 [cs.LG])
    Item ranking systems support users in multi-criteria decision-making tasks. Users need to trust rankings and ranking algorithms to reflect user preferences nicely while avoiding systematic errors and biases. However, today only few approaches help end users, model developers, and analysts to explain rankings. We report on the study of explanation approaches from the perspectives of recommender systems, explainable AI, and visualization research and propose the first cross-domain design space for explainers of item rankings. In addition, we leverage the descriptive power of the design space to characterize a) existing explainers and b) three main user groups involved in ranking explanation tasks. The generative power of the design space is a means for future designers and developers to create more target-oriented solutions in this only weakly exploited space.
    Learning Adaptive Propagation for Knowledge Graph Reasoning. (arXiv:2205.15319v1 [cs.LG])
    Due to the success of Graph Neural Networks (GNNs) in learning from graph-structured data, various GNN-based methods have been introduced to learn from knowledge graphs (KGs). In this paper, to reveal the key factors underneath existing GNN-based methods, we revisit exemplar works from the lens of the propagation path. We find that the answer entity can be close to queried one, but the information dependency can be long. Thus, better reasoning performance can be obtained by exploring longer propagation paths. However, identifying such a long-range dependency in KG is hard since the number of involved entities grows exponentially. This motivates us to learn an adaptive propagation path that filters out irrelevant entities while preserving promising targets during the propagation. First, we design an incremental sampling mechanism where the close and promising target can be preserved. Second, we design a learning-based sampling distribution to identify the targets with fewer involved entities. In this way, GNN can go deeper to capture long-range information. Extensive experiments show that our method is efficient and achieves state-of-the-art performances in both transductive and inductive reasoning settings, benefiting from the deeper propagation.
    Sepsis Prediction with Temporal Convolutional Networks. (arXiv:2205.15492v1 [cs.LG])
    We design and implement a temporal convolutional network model to predict sepsis onset. Our model is trained on data extracted from MIMIC III database, based on a retrospective analysis of patients admitted to intensive care unit who did not fall under the definition of sepsis at the time of admission. Benchmarked with several machine learning models, our model is superior on this binary classification task, demonstrates the prediction power of convolutional networks for temporal patterns, also shows the significant impact of having longer look back time on sepsis prediction.
    A novel approach to rating transition modelling via Machine Learning and SDEs on Lie groups. (arXiv:2205.15699v1 [q-fin.RM])
    In this paper, we introduce a novel methodology to model rating transitions with a stochastic process. To introduce stochastic processes, whose values are valid rating matrices, we noticed the geometric properties of stochastic matrices and its link to matrix Lie groups. We give a gentle introduction to this topic and demonstrate how It\^o-SDEs in R will generate the desired model for rating transitions. To calibrate the rating model to historical data, we use a Deep-Neural-Network (DNN) called TimeGAN to learn the features of a time series of historical rating matrices. Then, we use this DNN to generate synthetic rating transition matrices. Afterwards, we fit the moments of the generated rating matrices and the rating process at specific time points, which results in a good fit. After calibration, we discuss the quality of the calibrated rating transition process by examining some properties that a time series of rating matrices should satisfy, and we will see that this geometric approach works very well.
    A deep learning approach to halo merger tree construction. (arXiv:2205.15988v1 [astro-ph.GA])
    A key ingredient for semi-analytic models (SAMs) of galaxy formation is the mass assembly history of haloes, encoded in a tree structure. The most commonly used method to construct halo merger histories is based on the outcomes of high-resolution, computationally intensive N-body simulations. We show that machine learning (ML) techniques, in particular Generative Adversarial Networks (GANs), are a promising new tool to tackle this problem with a modest computational cost and retaining the best features of merger trees from simulations. We train our GAN model with a limited sample of merger trees from the EAGLE simulation suite, constructed using two halo finders-tree builder algorithms: SUBFIND-D-TREES and ROCKSTAR-ConsistentTrees. Our GAN model successfully learns to generate well-constructed merger tree structures with high temporal resolution, and to reproduce the statistical features of the sample of merger trees used for training, when considering up to three variables in the training process. These inputs, whose representations are also learned by our GAN model, are mass of the halo progenitors and the final descendant, progenitor type (main halo or satellite) and distance of a progenitor to that in the main branch. The inclusion of the latter two inputs greatly improves the final learned representation of the halo mass growth history, especially for SUBFIND-like ML trees. When comparing equally sized samples of ML merger trees with those of the EAGLE simulation, we find better agreement for SUBFIND-like ML trees. Finally, our GAN-based framework can be utilised to construct merger histories of low and intermediate mass haloes, the most abundant in cosmological simulations.
    Optimistic Whittle Index Policy: Online Learning for Restless Bandits. (arXiv:2205.15372v1 [cs.LG])
    Restless multi-armed bandits (RMABs) extend multi-armed bandits to allow for stateful arms, where the state of each arm evolves restlessly with different transitions depending on whether that arm is pulled. However, solving RMABs requires information on transition dynamics, which is often not available upfront. To plan in RMAB settings with unknown transitions, we propose the first online learning algorithm based on the Whittle index policy, using an upper confidence bound (UCB) approach to learn transition dynamics. Specifically, we formulate a bilinear program to compute the optimistic Whittle index from the confidence bounds in transition dynamics. Our algorithm, UCWhittle, achieves sublinear $O(\sqrt{T \log T})$ frequentist regret to solve RMABs with unknown transitions. Empirically, we demonstrate that UCWhittle leverages the structure of RMABs and the Whittle index policy solution to achieve better performance than existing online learning baselines across three domains, including on real-world maternal and childcare data aimed at reducing maternal mortality.
    Gluing Neural Networks Symbolically Through Hyperdimensional Computing. (arXiv:2205.15534v1 [cs.SC])
    Hyperdimensional Computing affords simple, yet powerful operations to create long Hyperdimensional Vectors (hypervectors) that can efficiently encode information, be used for learning, and are dynamic enough to be modified on the fly. In this paper, we explore the notion of using binary hypervectors to directly encode the final, classifying output signals of neural networks in order to fuse differing networks together at the symbolic level. This allows multiple neural networks to work together to solve a problem, with little additional overhead. Output signals just before classification are encoded as hypervectors and bundled together through consensus summation to train a classification hypervector. This process can be performed iteratively and even on single neural networks by instead making a consensus of multiple classification hypervectors. We find that this outperforms the state of the art, or is on a par with it, while using very little overhead, as hypervector operations are extremely fast and efficient in comparison to the neural networks. This consensus process can learn online and even grow or lose models in real time. Hypervectors act as memories that can be stored, and even further bundled together over time, affording life long learning capabilities. Additionally, this consensus structure inherits the benefits of Hyperdimensional Computing, without sacrificing the performance of modern Machine Learning. This technique can be extrapolated to virtually any neural model, and requires little modification to employ - one simply requires recording the output signals of networks when presented with a testing example.
    Comparing interpretation methods in mental state decoding analyses with deep learning models. (arXiv:2205.15581v1 [q-bio.NC])
    Deep learning (DL) methods find increasing application in mental state decoding, where researchers seek to understand the mapping between mental states (such as accepting or rejecting a gamble) and brain activity, by identifying those brain regions (and networks) whose activity allows to accurately identify (i.e., decode) these states. Once DL models have been trained to accurately decode a set of mental states, neuroimaging researchers often make use of interpretation methods from explainable artificial intelligence research to understand their learned mappings between mental states and brain activity. Here, we compare the explanations of prominent interpretation methods for the mental state decoding decisions of DL models trained on three functional Magnetic Resonance Imaging (fMRI) datasets. We find that interpretation methods that capture the model's decision process well, by producing faithful explanations, generally produce explanations that are less in line with the results of standard analyses of the fMRI data, when compared to the explanations of interpretation methods with less explanation faithfulness. Specifically, we find that interpretation methods that focus on how sensitively a model's decoding decision changes with the values of the input produce explanations that better match with the results of a standard general linear model analysis of the fMRI data, while interpretation methods that focus on identifying the specific contribution of an input feature's value to the decoding decision produce overall more faithful explanations that align less well with the results of standard analyses of the fMRI data.
    Knowledge Enhanced Neural Networks for relational domains. (arXiv:2205.15762v1 [cs.LG])
    In the recent past, there has been a growing interest in Neural-Symbolic Integration frameworks, i.e., hybrid systems that integrate connectionist and symbolic approaches to obtain the best of both worlds. In this work we focus on a specific method, KENN (Knowledge Enhanced Neural Networks), a Neural-Symbolic architecture that injects prior logical knowledge into a neural network by adding on its top a residual layer that modifies the initial predictions accordingly to the knowledge. Among the advantages of this strategy, there is the inclusion of clause weights, learnable parameters that represent the strength of the clauses, meaning that the model can learn the impact of each rule on the final predictions. As a special case, if the training data contradicts a constraint, KENN learns to ignore it, making the system robust to the presence of wrong knowledge. In this paper, we propose an extension of KENN for relational data. One of the main advantages of KENN resides in its scalability, thanks to a flexible treatment of dependencies between the rules obtained by stacking multiple logical layers. We show experimentally the efficacy of this strategy. The results show that KENN is capable of increasing the performances of the underlying neural network, obtaining better or comparable accuracies in respect to other two related methods that combine learning with logic, requiring significantly less time for learning.
    VQ-AR: Vector Quantized Autoregressive Probabilistic Time Series Forecasting. (arXiv:2205.15894v1 [cs.LG])
    Time series models aim for accurate predictions of the future given the past, where the forecasts are used for important downstream tasks like business decision making. In practice, deep learning based time series models come in many forms, but at a high level learn some continuous representation of the past and use it to output point or probabilistic forecasts. In this paper, we introduce a novel autoregressive architecture, VQ-AR, which instead learns a \emph{discrete} set of representations that are used to predict the future. Extensive empirical comparison with other competitive deep learning models shows that surprisingly such a discrete set of representations gives state-of-the-art or equivalent results on a wide variety of time series datasets. We also highlight the shortcomings of this approach, explore its zero-shot generalization capabilities, and present an ablation study on the number of representations. The full source code of the method will be available at the time of publication with the hope that researchers can further investigate this important but overlooked inductive bias for the time series domain.
    Static Scheduling with Predictions Learned through Efficient Exploration. (arXiv:2205.15695v1 [cs.LG])
    A popular approach to go beyond the worst-case analysis of online algorithms is to assume the existence of predictions that can be leveraged to improve performances. Those predictions are usually given by some external sources that cannot be fully trusted. Instead, we argue that trustful predictions can be built by algorithms, while they run. We investigate this idea in the illustrative context of static scheduling with exponential job sizes. Indeed, we prove that algorithms agnostic to this structure do not perform better than in the worst case. In contrast, when the expected job sizes are known, we show that the best algorithm using this information, called Follow-The-Perfect-Prediction (FTPP), exhibits much better performances. Then, we introduce two adaptive explore-then-commit types of algorithms: they both first (partially) learn expected job sizes and then follow FTPP once their self-predictions are confident enough. On the one hand, ETCU explores in "series", by completing jobs sequentially to acquire information. On the other hand, ETCRR, inspired by the optimal worst-case algorithm Round-Robin (RR), explores efficiently in "parallel". We prove that both of them asymptotically reach the performances of FTPP, with a faster rate for ETCRR. Those findings are empirically evaluated on synthetic data.
    Meta-ticket: Finding optimal subnetworks for few-shot learning within randomly initialized neural networks. (arXiv:2205.15619v1 [cs.LG])
    Few-shot learning for neural networks (NNs) is an important problem that aims to train NNs with a few data. The main challenge is how to avoid overfitting since over-parameterized NNs can easily overfit to such small dataset. Previous work (e.g. MAML by Finn et al. 2017) tackles this challenge by meta-learning, which learns how to learn from a few data by using various tasks. On the other hand, one conventional approach to avoid overfitting is restricting hypothesis spaces by endowing sparse NN structures like convolution layers in computer vision. However, although such manually-designed sparse structures are sample-efficient for sufficiently large datasets, they are still insufficient for few-shot learning. Then the following questions naturally arise: (1) Can we find sparse structures effective for few-shot learning by meta-learning? (2) What benefits will it bring in terms of meta-generalization? In this work, we propose a novel meta-learning approach, called Meta-ticket, to find optimal sparse subnetworks for few-shot learning within randomly initialized NNs. We empirically validated that Meta-ticket successfully discover sparse subnetworks that can learn specialized features for each given task. Due to this task-wise adaptation ability, Meta-ticket achieves superior meta-generalization compared to MAML-based methods especially with large NNs.
    Optimizing Intermediate Representations of Generative Models for Phase Retrieval. (arXiv:2205.15617v1 [cs.LG])
    Phase retrieval is the problem of reconstructing images from magnitude-only measurements. In many real-world applications the problem is underdetermined. When training data is available, generative models are a new idea to constrain the solution set. However, not all possible solutions are within the range of the generator. Instead, they are represented with some error. To reduce this representation error in the context of phase retrieval, we first leverage a novel variation of intermediate layer optimization (ILO) to extend the range of the generator while still producing images consistent with the training data. Second, we introduce new initialization schemes that further improve the quality of the reconstruction. With extensive experiments on Fourier and Gaussian phase retrieval problems and thorough ablation studies, we can show the benefits of our modified ILO and the new initialization schemes.
    Exact Feature Collisions in Neural Networks. (arXiv:2205.15763v1 [cs.LG])
    Predictions made by deep neural networks were shown to be highly sensitive to small changes made in the input space where such maliciously crafted data points containing small perturbations are being referred to as adversarial examples. On the other hand, recent research suggests that the same networks can also be extremely insensitive to changes of large magnitude, where predictions of two largely different data points can be mapped to approximately the same output. In such cases, features of two data points are said to approximately collide, thus leading to the largely similar predictions. Our results improve and extend the work of Li et al.(2019), laying out theoretical grounds for the data points that have colluding features from the perspective of weights of neural networks, revealing that neural networks not only suffer from features that approximately collide but also suffer from features that exactly collide. We identify the necessary conditions for the existence of such scenarios, hereby investigating a large number of DNNs that have been used to solve various computer vision problems. Furthermore, we propose the Null-space search, a numerical approach that does not rely on heuristics, to create data points with colliding features for any input and for any task, including, but not limited to, classification, localization, and segmentation.
    Machine learning a manifold. (arXiv:2112.07673v2 [hep-ph] UPDATED)
    We propose a simple method to identify a continuous Lie algebra symmetry in a dataset through regression by an artificial neural network. Our proposal takes advantage of the $ \mathcal{O}(\epsilon^2)$ scaling of the output variable under infinitesimal symmetry transformations on the input variables. As symmetry transformations are generated post-training, the methodology does not rely on sampling of the full representation space or binning of the dataset, and the possibility of false identification is minimised. We demonstrate our method in the SU(3)-symmetric (non-) linear $\Sigma$ model.
    Provable General Function Class Representation Learning in Multitask Bandits and MDPs. (arXiv:2205.15701v1 [cs.LG])
    While multitask representation learning has become a popular approach in reinforcement learning (RL) to boost the sample efficiency, the theoretical understanding of why and how it works is still limited. Most previous analytical works could only assume that the representation function is already known to the agent or from linear function class, since analyzing general function class representation encounters non-trivial technical obstacles such as generalization guarantee, formulation of confidence bound in abstract function space, etc. However, linear-case analysis heavily relies on the particularity of linear function class, while real-world practice usually adopts general non-linear representation functions like neural networks. This significantly reduces its applicability. In this work, we extend the analysis to general function class representations. Specifically, we consider an agent playing $M$ contextual bandits (or MDPs) concurrently and extracting a shared representation function $\phi$ from a specific function class $\Phi$ using our proposed Generalized Functional Upper Confidence Bound algorithm (GFUCB). We theoretically validate the benefit of multitask representation learning within general function class for bandits and linear MDP for the first time. Lastly, we conduct experiments to demonstrate the effectiveness of our algorithm with neural net representation.
    MACE: An Efficient Model-Agnostic Framework for Counterfactual Explanation. (arXiv:2205.15540v1 [cs.AI])
    Counterfactual explanation is an important Explainable AI technique to explain machine learning predictions. Despite being studied actively, existing optimization-based methods often assume that the underlying machine-learning model is differentiable and treat categorical attributes as continuous ones, which restricts their real-world applications when categorical attributes have many different values or the model is non-differentiable. To make counterfactual explanation suitable for real-world applications, we propose a novel framework of Model-Agnostic Counterfactual Explanation (MACE), which adopts a newly designed pipeline that can efficiently handle non-differentiable machine-learning models on a large number of feature values. in our MACE approach, we propose a novel RL-based method for finding good counterfactual examples and a gradient-less descent method for improving proximity. Experiments on public datasets validate the effectiveness with better validity, sparsity and proximity.
    Generalised Implicit Neural Representations. (arXiv:2205.15674v1 [cs.LG])
    We consider the problem of learning implicit neural representations (INRs) for signals on non-Euclidean domains. In the Euclidean case, INRs are trained on a discrete sampling of a signal over a regular lattice. Here, we assume that the continuous signal exists on some unknown topological space from which we sample a discrete graph. In the absence of a coordinate system to identify the sampled nodes, we propose approximating their location with a spectral embedding of the graph. This allows us to train INRs without knowing the underlying continuous domain, which is the case for most graph signals in nature, while also making the INRs equivariant under the symmetry group of the domain. We show experiments with our method on various real-world signals on non-Euclidean domains.
    itKD: Interchange Transfer-based Knowledge Distillation for 3D Object Detection. (arXiv:2205.15531v1 [cs.CV])
    Recently, point-cloud based 3D object detectors have achieved remarkable progress. However, most studies are limited to the development of deep learning architectures for improving only their accuracy. In this paper, we propose an autoencoder-style framework comprising channel-wise compression and decompression via interchange transfer for knowledge distillation. To learn the map-view feature of a teacher network, the features from a teacher and student network are independently passed through the shared autoencoder; here, we use a compressed representation loss that binds the channel-wised compression knowledge from both the networks as a kind of regularization. The decompressed features are transferred in opposite directions to reduce the gap in the interchange reconstructions. Lastly, we present an attentive head loss for matching the pivotal detection information drawn by the multi-head self-attention mechanism. Through extensive experiments, we verify that our method can learn the lightweight model that is well-aligned with the 3D point cloud detection task and we demonstrate its superiority using the well-known public datasets Waymo and nuScenes.
    StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis. (arXiv:2205.15439v1 [eess.AS])
    Text-to-Speech (TTS) has recently seen great progress in synthesizing high-quality speech owing to the rapid development of parallel TTS systems, but producing speech with naturalistic prosodic variations, speaking styles and emotional tones remains challenging. Moreover, since duration and speech are generated separately, parallel TTS models still have problems finding the best monotonic alignments that are crucial for naturalistic speech synthesis. Here, we propose StyleTTS, a style-based generative model for parallel TTS that can synthesize diverse speech with natural prosody from a reference speech utterance. With novel Transferable Monotonic Aligner (TMA) and duration-invariant data augmentation schemes, our method significantly outperforms state-of-the-art models on both single and multi-speaker datasets in subjective tests of speech naturalness and speaker similarity. Through self-supervised learning of the speaking styles, our model can synthesize speech with the same prosodic and emotional tone as any given reference speech without the need for explicitly labeling these categories.
    Graph-level Neural Networks: Current Progress and Future Directions. (arXiv:2205.15555v1 [cs.LG])
    Graph-structured data consisting of objects (i.e., nodes) and relationships among objects (i.e., edges) are ubiquitous. Graph-level learning is a matter of studying a collection of graphs instead of a single graph. Traditional graph-level learning methods used to be the mainstream. However, with the increasing scale and complexity of graphs, Graph-level Neural Networks (GLNNs, deep learning-based graph-level learning methods) have been attractive due to their superiority in modeling high-dimensional data. Thus, a survey on GLNNs is necessary. To frame this survey, we propose a systematic taxonomy covering GLNNs upon deep neural networks, graph neural networks, and graph pooling. The representative and state-of-the-art models in each category are focused on this survey. We also investigate the reproducibility, benchmarks, and new graph datasets of GLNNs. Finally, we conclude future directions to further push forward GLNNs. The repository of this survey is available at https://github.com/GeZhangMQ/Awesome-Graph-level-Neural-Networks.
    Augmentation-Aware Self-Supervision for Data-Efficient GAN Training. (arXiv:2205.15677v1 [cs.LG])
    Training generative adversarial networks (GANs) with limited data is valuable but challenging because discriminators are prone to over-fitting in such situations. Recently proposed differentiable data augmentation techniques for discriminators demonstrate improved data efficiency of training GANs. However, the naive data augmentation introduces undesired invariance to augmentation into the discriminator. The invariance may degrade the representation learning ability of the discriminator, thereby affecting the generative modeling performance of the generator. To mitigate the invariance while inheriting the benefits of data augmentation, we propose a novel augmentation-aware self-supervised discriminator that predicts the parameter of augmentation given the augmented and original data. Moreover, the prediction task is required to distinguishable between real data and generated data since they are different during training. We further encourage the generator to learn from the proposed discriminator by generating augmentation-predictable real data. We compare the proposed method with state-of-the-arts across the class-conditional BigGAN and unconditional StyleGAN2 architectures on CIFAR-10/100 and several low-shot datasets, respectively. Experimental results show a significantly improved generation performance of our method over competing methods for training data-efficient GANs.
    Minimax Optimal Online Imitation Learning via Replay Estimation. (arXiv:2205.15397v1 [cs.LG])
    Online imitation learning is the problem of how best to mimic expert demonstrations, given access to the environment or an accurate simulator. Prior work has shown that in the infinite sample regime, exact moment matching achieves value equivalence to the expert policy. However, in the finite sample regime, even if one has no optimization error, empirical variance can lead to a performance gap that scales with $H^2 / N$ for behavioral cloning and $H / \sqrt{N}$ for online moment matching, where $H$ is the horizon and $N$ is the size of the expert dataset. We introduce the technique of replay estimation to reduce this empirical variance: by repeatedly executing cached expert actions in a stochastic simulator, we compute a smoother expert visitation distribution estimate to match. In the presence of general function approximation, we prove a meta theorem reducing the performance gap of our approach to the parameter estimation error for offline classification (i.e. learning the expert policy). In the tabular setting or with linear function approximation, our meta theorem shows that the performance gap incurred by our approach achieves the optimal $\widetilde{O} \left( \min({H^{3/2}} / {N}, {H} / {\sqrt{N}} \right)$ dependency, under significantly weaker assumptions compared to prior work. We implement multiple instantiations of our approach on several continuous control tasks and find that we are able to significantly improve policy performance across a variety of dataset sizes.
    Certifying Some Distributional Fairness with Subpopulation Decomposition. (arXiv:2205.15494v1 [cs.LG])
    Extensive efforts have been made to understand and improve the fairness of machine learning models based on observational metrics, especially in high-stakes domains such as medical insurance, education, and hiring decisions. However, there is a lack of certified fairness considering the end-to-end performance of an ML model. In this paper, we first formulate the certified fairness of an ML model trained on a given data distribution as an optimization problem based on the model performance loss bound on a fairness constrained distribution, which is within bounded distributional distance with the training distribution. We then propose a general fairness certification framework and instantiate it for both sensitive shifting and general shifting scenarios. In particular, we propose to solve the optimization problem by decomposing the original data distribution into analytical subpopulations and proving the convexity of the subproblems to solve them. We evaluate our certified fairness on six real-world datasets and show that our certification is tight in the sensitive shifting scenario and provides non-trivial certification under general shifting. Our framework is flexible to integrate additional non-skewness constraints and we show that it provides even tighter certification under different real-world scenarios. We also compare our certified fairness bound with adapted existing distributional robustness bounds on Gaussian data and demonstrate that our method is significantly tighter.
    FBM: Fast-Bit Allocation for Mixed-Precision Quantization. (arXiv:2205.15437v1 [cs.LG])
    Quantized neural networks are well known for reducing latency, power consumption, and model size without significant degradation in accuracy, making them highly applicable for systems with limited resources and low power requirements. Mixed precision quantization offers better utilization of customized hardware that supports arithmetic operations at different bitwidths. Existing mixed-precision schemes rely on having a high exploration space, resulting in a large carbon footprint. In addition, these bit allocation strategies mostly induce constraints on the model size rather than utilizing the performance of neural network deployment on specific hardware. Our work proposes Fast-Bit Allocation for Mixed-Precision Quantization (FBM), which finds an optimal bitwidth allocation by measuring desired behaviors through a simulation of a specific device, or even on a physical one. While dynamic transitions of bit allocation in mixed precision quantization with ultra-low bitwidth are known to suffer from performance degradation, we present a fast recovery solution from such transitions. A comprehensive evaluation of the proposed method on CIFAR-10 and ImageNet demonstrates our method's superiority over current state-of-the-art schemes in terms of the trade-off between neural network accuracy and hardware efficiency. Our source code, experimental settings and quantized models are available at https://github.com/RamorayDrake/FBM/
    Connecting adversarial attacks and optimal transport for domain adaptation. (arXiv:2205.15424v1 [cs.LG])
    We present a novel algorithm for domain adaptation using optimal transport. In domain adaptation, the goal is to adapt a classifier trained on the source domain samples to the target domain. In our method, we use optimal transport to map target samples to the domain named source fiction. This domain differs from the source but is accurately classified by the source domain classifier. Our main idea is to generate a source fiction by c-cyclically monotone transformation over the target domain. If samples with the same labels in two domains are c-cyclically monotone, the optimal transport map between these domains preserves the class-wise structure, which is the main goal of domain adaptation. To generate a source fiction domain, we propose an algorithm that is based on our finding that adversarial attacks are a c-cyclically monotone transformation of the dataset. We conduct experiments on Digits and Modern Office-31 datasets and achieve improvement in performance for simple discrete optimal transport solvers for all adaptation tasks.
    Private Federated Submodel Learning with Sparsification. (arXiv:2205.15992v1 [cs.IT])
    We investigate the problem of private read update write (PRUW) in federated submodel learning (FSL) with sparsification. In FSL, a machine learning model is divided into multiple submodels, where each user updates only the submodel that is relevant to the user's local data. PRUW is the process of privately performing FSL by reading from and writing to the required submodel without revealing the submodel index, or the values of updates to the databases. Sparsification is a widely used concept in learning, where the users update only a small fraction of parameters to reduce the communication cost. Revealing the coordinates of these selected (sparse) updates leaks privacy of the user. We show how PRUW in FSL can be performed with sparsification. We propose a novel scheme which privately reads from and writes to arbitrary parameters of any given submodel, without revealing the submodel index, values of updates, or the coordinates of the sparse updates, to databases. The proposed scheme achieves significantly lower reading and writing costs compared to what is achieved without sparsification.
    A Topological Perspective on Causal Inference. (arXiv:2107.08558v3 [cs.AI] UPDATED)
    This paper presents a topological learning-theoretic perspective on causal inference by introducing a series of topologies defined on general spaces of structural causal models (SCMs). As an illustration of the framework we prove a topological causal hierarchy theorem, showing that substantive assumption-free causal inference is possible only in a meager set of SCMs. Thanks to a known correspondence between open sets in the weak topology and statistically verifiable hypotheses, our results show that inductive assumptions sufficient to license valid causal inferences are statistically unverifiable in principle. Similar to no-free-lunch theorems for statistical inference, the present results clarify the inevitability of substantial assumptions for causal inference. An additional benefit of our topological approach is that it easily accommodates SCMs with infinitely many variables. We finally suggest that the framework may be helpful for the positive project of exploring and assessing alternative causal-inductive assumptions.
    Grid HTM: Hierarchical Temporal Memory for Anomaly Detection in Videos. (arXiv:2205.15407v1 [cs.CV])
    The interest for video anomaly detection systems has gained traction for the past few years. The current approaches use deep learning to perform anomaly detection in videos, but this approach has multiple problems. For starters, deep learning in general has issues with noise, concept drift, explainability, and training data volumes. Additionally, anomaly detection in itself is a complex task and faces challenges such as unknowness, heterogeneity, and class imbalance. Anomaly detection using deep learning is therefore mainly constrained to generative models such as generative adversarial networks and autoencoders due to their unsupervised nature, but even they suffer from general deep learning issues and are hard to train properly. In this paper, we explore the capabilities of the Hierarchical Temporal Memory (HTM) algorithm to perform anomaly detection in videos, as it has favorable properties such as noise tolerance and online learning which combats concept drift. We introduce a novel version of HTM, namely, Grid HTM, which is an HTM-based architecture specifically for anomaly detection in complex videos such as surveillance footage.
    GLDQN: Explicitly Parameterized Quantile Reinforcement Learning for Waste Reduction. (arXiv:2205.15455v1 [cs.LG])
    We study the problem of restocking a grocery store's inventory with perishable items over time, from a distributional point of view. The objective is to maximize sales while minimizing waste, with uncertainty about the actual consumption by costumers. This problem is of a high relevance today, given the growing demand for food and the impact of food waste on the environment, the economy, and purchasing power. We frame inventory restocking as a new reinforcement learning task that exhibits stochastic behavior conditioned on the agent's actions, making the environment partially observable. We introduce a new reinforcement learning environment based on real grocery store data and expert knowledge. This environment is highly stochastic, and presents a unique challenge for reinforcement learning practitioners. We show that uncertainty about the future behavior of the environment is not handled well by classical supply chain algorithms, and that distributional approaches are a good way to account for the uncertainty. We also present GLDQN, a new distributional reinforcement learning algorithm that learns a generalized lambda distribution over the reward space. We show that GLDQN outperforms other distributional reinforcement learning approaches in our partially observable environments, in both overall reward and generated waste.
    Associative Learning Mechanism for Drug-Target Interaction Prediction. (arXiv:2205.15364v1 [q-bio.BM])
    As a necessary process in drug development, finding a drug compound that can selectively bind to a specific protein is highly challenging and costly. Drug-target affinity (DTA), which represents the strength of drug-target interaction (DTI), has played an important role in the DTI prediction task over the past decade. Although deep learning has been applied to DTA-related research, existing solutions ignore fundamental correlations between molecular substructures in molecular representation learning of drug compound molecules/protein targets. Moreover, traditional methods lack the interpretability of the DTA prediction process. This results in missing feature information of intermolecular interactions, thereby affecting prediction performance. Therefore, this paper proposes a DTA prediction method with interactive learning and an autoencoder mechanism. The proposed model enhances the corresponding ability to capture the feature information of a single molecular sequence by the drug/protein molecular representation learning module and supplements the information interaction between molecular sequence pairs by the interactive information learning module. The DTA value prediction module fuses the drug-target pair interaction information to output the predicted value of DTA. Additionally, this paper theoretically proves that the proposed method maximizes evidence lower bound (ELBO) for the joint distribution of the DTA prediction model, which enhances the consistency of the probability distribution between the actual value and the predicted value. The experimental results confirm mutual transformer-drug target affinity (MT-DTA) achieves better performance than other comparative methods.
    Data Banzhaf: A Data Valuation Framework with Maximal Robustness to Learning Stochasticity. (arXiv:2205.15466v1 [cs.LG])
    This paper studies the robustness of data valuation to noisy model performance scores. Particularly, we find that the inherent randomness of the widely used stochastic gradient descent can cause existing data value notions (e.g., the Shapley value and the Leave-one-out error) to produce inconsistent data value rankings across different runs. To address this challenge, we first pose a formal framework within which one can measure the robustness of a data value notion. We show that the Banzhaf value, a value notion originated from cooperative game theory literature, achieves the maximal robustness among all semivalues -- a class of value notions that satisfy crucial properties entailed by ML applications. We propose an algorithm to efficiently estimate the Banzhaf value based on the Maximum Sample Reuse (MSR) principle. We derive the lower bound sample complexity for Banzhaf value approximation, and we show that our MSR algorithm's sample complexity nearly matches the lower bound. Our evaluation demonstrates that the Banzhaf value outperforms the existing semivalue-based data value notions on several downstream ML tasks such as learning with weighted samples and noisy label detection. Overall, our study suggests that when the underlying ML algorithm is stochastic, the Banzhaf value is a promising alternative to the semivalue-based data value schemes given its computational advantage and ability to robustly differentiate data quality.
    Truly Deterministic Policy Optimization. (arXiv:2205.15379v1 [cs.AI])
    In this paper, we present a policy gradient method that avoids exploratory noise injection and performs policy search over the deterministic landscape. By avoiding noise injection all sources of estimation variance can be eliminated in systems with deterministic dynamics (up to the initial state distribution). Since deterministic policy regularization is impossible using traditional non-metric measures such as the KL divergence, we derive a Wasserstein-based quadratic model for our purposes. We state conditions on the system model under which it is possible to establish a monotonic policy improvement guarantee, propose a surrogate function for policy gradient estimation, and show that it is possible to compute exact advantage estimates if both the state transition model and the policy are deterministic. Finally, we describe two novel robotic control environments -- one with non-local rewards in the frequency domain and the other with a long horizon (8000 time-steps) -- for which our policy gradient method (TDPO) significantly outperforms existing methods (PPO, TRPO, DDPG, and TD3). Our implementation with all the experimental settings is available at https://github.com/ehsansaleh/code_tdpo
    Learning Risk-Averse Equilibria in Multi-Agent Systems. (arXiv:2205.15434v1 [cs.LG])
    In multi-agent systems, intelligent agents are tasked with making decisions that have optimal outcomes when the actions of the other agents are as expected, whilst also being prepared for unexpected behaviour. In this work, we introduce a new risk-averse solution concept that allows the learner to accommodate unexpected actions by finding the minimum variance strategy given any level of expected return. We prove the existence of such a risk-averse equilibrium, and propose one fictitious-play type learning algorithm for smaller games that enjoys provable convergence guarantees in certain games classes (e.g., zero-sum or potential). Furthermore, we propose an approximation method for larger games based on iterative population-based training that generates a population of risk-averse agents. Empirically, our equilibrium is shown to be able to reduce the reward variance, specifically in the sense that off-equilibrium behaviour has a far smaller impact on our risk-averse agents in comparison to playing other equilibrium solutions. Importantly, we show that our population of agents that approximate a risk-averse equilibrium is particularly effective in the presence of unseen opposing populations, especially in the case of guaranteeing a minimal level of performance which is critical to safety-aware multi-agent systems.
    Improvements to Supervised EM Learning of Shared Kernel Models by Feature Space Partitioning. (arXiv:2205.15304v1 [cs.LG])
    Expectation maximisation (EM) is usually thought of as an unsupervised learning method for estimating the parameters of a mixture distribution, however it can also be used for supervised learning when class labels are available. As such, EM has been applied to train neural nets including the probabilistic radial basis function (PRBF) network or shared kernel (SK) model. This paper addresses two major shortcomings of previous work in this area: the lack of rigour in the derivation of the EM training algorithm; and the computational complexity of the technique, which has limited it to low dimensional data sets. We first present a detailed derivation of EM for the Gaussian shared kernel model PRBF classifier, making use of data association theory to obtain the complete data likelihood, Baum's auxiliary function (the E-step) and its subsequent maximisation (M-step). To reduce complexity of the resulting SKEM algorithm, we partition the feature space into $R$ non-overlapping subsets of variables. The resulting product decomposition of the joint data likelihood, which is exact when the feature partitions are independent, allows the SKEM to be implemented in parallel and at $R^2$ times lower complexity. The operation of the partitioned SKEM algorithm is demonstrated on the MNIST data set and compared with its non-partitioned counterpart. It eventuates that improved performance at reduced complexity is achievable. Comparisons with standard classification algorithms are provided on a number of other benchmark data sets.
    Parameter-Efficient and Student-Friendly Knowledge Distillation. (arXiv:2205.15308v1 [cs.LG])
    Knowledge distillation (KD) has been extensively employed to transfer the knowledge from a large teacher model to the smaller students, where the parameters of the teacher are fixed (or partially) during training. Recent studies show that this mode may cause difficulties in knowledge transfer due to the mismatched model capacities. To alleviate the mismatch problem, teacher-student joint training methods, e.g., online distillation, have been proposed, but it always requires expensive computational cost. In this paper, we present a parameter-efficient and student-friendly knowledge distillation method, namely PESF-KD, to achieve efficient and sufficient knowledge transfer by updating relatively few partial parameters. Technically, we first mathematically formulate the mismatch as the sharpness gap between their predictive distributions, where we show such a gap can be narrowed with the appropriate smoothness of the soft label. Then, we introduce an adapter module for the teacher and only update the adapter to obtain soft labels with appropriate smoothness. Experiments on a variety of benchmarks show that PESF-KD can significantly reduce the training cost while obtaining competitive results compared to advanced online distillation methods. Code will be released upon acceptance.
    Attention Flows for General Transformers. (arXiv:2205.15389v1 [cs.LG])
    In this paper, we study the computation of how much an input token in a Transformer model influences its prediction. We formalize a method to construct a flow network out of the attention values of encoder-only Transformer models and extend it to general Transformer architectures including an auto-regressive decoder. We show that running a maxflow algorithm on the flow network construction yields Shapley values, which determine the impact of a player in cooperative game theory. By interpreting the input tokens in the flow network as players, we can compute their influence on the total attention flow leading to the decoder's decision. Additionally, we provide a library that computes and visualizes the attention flow of arbitrary Transformer models. We show the usefulness of our implementation on various models trained on natural language processing and reasoning tasks.
    Payday loans -- blessing or growth suppressor? Machine Learning Analysis. (arXiv:2205.15320v1 [econ.GN])
    The upsurge of real estate involves a variety of factors that have got influenced by many domains. Indeed, the unrecognized sector that would affect the economy for which regulatory proposals are being drafted to keep this in control is the payday loans. This research paper revolves around the impact of payday loans in the real estate market. The research paper draws a first-hand experience of obtaining the index for the concentration of real estate in an area of reference by virtue of payday loans in Toronto, Ontario in particular, which sets out an ideology to create, evaluate and demonstrate the scenario through research analysis. The purpose of this indexing via payday loans is the basic - debt: income ratio which states that when the income of the person bound to pay the interest of payday loans increases, his debt goes down marginally which hence infers that the person invests in fixed assets like real estate which hikes up its growth.
  • Open

    PAC Generalization via Invariant Representations. (arXiv:2205.15196v2 [cs.LG] UPDATED)
    One method for obtaining generalizable solutions to machine learning tasks when presented with diverse training environments is to find invariant representations of the data. These are representations of the covariates such that the best model on top of the representation is invariant across training environments. In the context of linear Structural Equation Models (SEMs), invariant representations might allow us to learn models with out-of-distribution guarantees, i.e., models that are robust to interventions in the SEM. To address the invariant representation problem in a finite sample setting, we consider the notion of $\epsilon$-approximate invariance. We study the following question: If a representation is approximately invariant with respect to a given number of training interventions, will it continue to be approximately invariant on a larger collection of unseen SEMs? This larger collection of SEMs is generated through a parameterized family of interventions. Inspired by PAC learning, we obtain finite-sample out-of-distribution generalization guarantees for approximate invariance that holds probabilistically over a family of linear SEMs without faithfulness assumptions. Our results show bounds that do not scale in ambient dimension when intervention sites are restricted to lie in a constant size subset of in-degree bounded nodes. We also show how to extend our results to a linear indirect observation model that incorporates latent variables.
    Neural Galerkin Scheme with Active Learning for High-Dimensional Evolution Equations. (arXiv:2203.01360v3 [math.NA] UPDATED)
    Deep neural networks have been shown to provide accurate function approximations in high dimensions. However, fitting network parameters requires training data that may not be available beforehand, which is particularly challenging in science and engineering applications where often it is even unclear how to collect new informative training data in the first place. This work proposes Neural Galerkin schemes based on deep learning that generate training data samples with active learning for numerically solving high-dimensional partial differential equations. Neural Galerkin schemes train networks by minimizing the residual sequentially over time, which enables adaptively collecting new training data in a self-informed manner that is guided by the dynamics described by the partial differential equations, which is in stark contrast to many other machine learning methods that aim to fit network parameters globally in time without taking into account training data acquisition. Our finding is that the active form of gathering training data of the proposed Neural Galerkin schemes is key for numerically realizing the expressive power of networks in high dimensions. Numerical experiments demonstrate that Neural Galerkin schemes have the potential to enable simulating phenomena and processes with many variables for which traditional and other deep-learning-based solvers fail, especially when features of the solutions evolve locally such as in high-dimensional wave propagation problems and interacting particle systems described by Fokker-Planck and kinetic equations.
    Post-hoc Concept Bottleneck Models. (arXiv:2205.15480v1 [cs.LG])
    Concept Bottleneck Models (CBMs) map the inputs onto a set of interpretable concepts (``the bottleneck'') and use the concepts to make predictions. A concept bottleneck enhances interpretability since it can be investigated to understand what concepts the model "sees" in an input and which of these concepts are deemed important. However, CBMs are restrictive in practice as they require concept labels in the training data to learn the bottleneck and do not leverage strong pretrained models. Moreover, CBMs often do not match the accuracy of an unrestricted neural network, reducing the incentive to deploy them in practice. In this work, we address the limitations of CBMs by introducing Post-hoc Concept Bottleneck models (PCBMs). We show that we can turn any neural network into a PCBM without sacrificing model performance while still retaining interpretability benefits. When concept annotation is not available on the training data, we show that PCBM can transfer concepts from other datasets or from natural language descriptions of concepts. PCBM also enables users to quickly debug and update the model to reduce spurious correlations and improve generalization to new (potentially different) data. Through a model-editing user study, we show that editing PCBMs via concept-level feedback can provide significant performance gains without using any data from the target domain or model retraining.
    Proportional Fairness in Federated Learning. (arXiv:2202.01666v2 [cs.LG] UPDATED)
    With the increasingly broad deployment of Federated Learning (FL) systems in the real world, it is critical but challenging to ensure fairness in FL, i.e. reasonably satisfactory performances for each of the numerous diverse clients. Motivated by its great success in wireless networks, in this work, we introduce and study Proportional Fairness (PF) in FL. By viewing FL from a cooperative game perspective, where the players (clients) collaboratively learn a good model, we formulate PF as Nash bargaining solutions. Based on this concept, we propose PropFair, a novel and easy-to-implement algorithm for finding fair solutions in FL, with its convergence proved. Through extensive experiments on a wide array of vision and language datasets, we demonstrate that PropFair consistently achieves a noticeable improvement of the worst 10% accuracy over state-of-the-art fair FL algorithms, while maintaining competitive overall performance.
    Static Scheduling with Predictions Learned through Efficient Exploration. (arXiv:2205.15695v1 [cs.LG])
    A popular approach to go beyond the worst-case analysis of online algorithms is to assume the existence of predictions that can be leveraged to improve performances. Those predictions are usually given by some external sources that cannot be fully trusted. Instead, we argue that trustful predictions can be built by algorithms, while they run. We investigate this idea in the illustrative context of static scheduling with exponential job sizes. Indeed, we prove that algorithms agnostic to this structure do not perform better than in the worst case. In contrast, when the expected job sizes are known, we show that the best algorithm using this information, called Follow-The-Perfect-Prediction (FTPP), exhibits much better performances. Then, we introduce two adaptive explore-then-commit types of algorithms: they both first (partially) learn expected job sizes and then follow FTPP once their self-predictions are confident enough. On the one hand, ETCU explores in "series", by completing jobs sequentially to acquire information. On the other hand, ETCRR, inspired by the optimal worst-case algorithm Round-Robin (RR), explores efficiently in "parallel". We prove that both of them asymptotically reach the performances of FTPP, with a faster rate for ETCRR. Those findings are empirically evaluated on synthetic data.
    Neural Topic Model via Optimal Transport. (arXiv:2008.13537v3 [cs.IR] UPDATED)
    Recently, Neural Topic Models (NTMs) inspired by variational autoencoders have obtained increasingly research interest due to their promising results on text analysis. However, it is usually hard for existing NTMs to achieve good document representation and coherent/diverse topics at the same time. Moreover, they often degrade their performance severely on short documents. The requirement of reparameterisation could also comprise their training quality and model flexibility. To address these shortcomings, we present a new neural topic model via the theory of optimal transport (OT). Specifically, we propose to learn the topic distribution of a document by directly minimising its OT distance to the document's word distributions. Importantly, the cost matrix of the OT distance models the weights between topics and words, which is constructed by the distances between topics and words in an embedding space. Our proposed model can be trained efficiently with a differentiable loss. Extensive experiments show that our framework significantly outperforms the state-of-the-art NTMs on discovering more coherent and diverse topics and deriving better document representations for both regular and short texts.
    Hedging option books using neural-SDE market models. (arXiv:2205.15991v1 [q-fin.CP])
    We study the capability of arbitrage-free neural-SDE market models to yield effective strategies for hedging options. In particular, we derive sensitivity-based and minimum-variance-based hedging strategies using these models and examine their performance when applied to various option portfolios using real-world data. Through backtesting analysis over typical and stressed market periods, we show that neural-SDE market models achieve lower hedging errors than Black--Scholes delta and delta-vega hedging consistently over time, and are less sensitive to the tenor choice of hedging instruments. In addition, hedging using market models leads to similar performance to hedging using Heston models, while the former tends to be more robust during stressed market periods.
    Intrinsic Dimension Estimation Using Wasserstein Distances. (arXiv:2106.04018v2 [stat.ML] UPDATED)
    It has long been thought that high-dimensional data encountered in many practical machine learning tasks have low-dimensional structure, i.e., the manifold hypothesis holds. A natural question, thus, is to estimate the intrinsic dimension of a given population distribution from a finite sample. We introduce a new estimator of the intrinsic dimension and provide finite sample, non-asymptotic guarantees. We then apply our techniques to get new sample complexity bounds for Generative Adversarial Networks (GANs) depending only on the intrinsic dimension of the data.
    Meta-ticket: Finding optimal subnetworks for few-shot learning within randomly initialized neural networks. (arXiv:2205.15619v1 [cs.LG])
    Few-shot learning for neural networks (NNs) is an important problem that aims to train NNs with a few data. The main challenge is how to avoid overfitting since over-parameterized NNs can easily overfit to such small dataset. Previous work (e.g. MAML by Finn et al. 2017) tackles this challenge by meta-learning, which learns how to learn from a few data by using various tasks. On the other hand, one conventional approach to avoid overfitting is restricting hypothesis spaces by endowing sparse NN structures like convolution layers in computer vision. However, although such manually-designed sparse structures are sample-efficient for sufficiently large datasets, they are still insufficient for few-shot learning. Then the following questions naturally arise: (1) Can we find sparse structures effective for few-shot learning by meta-learning? (2) What benefits will it bring in terms of meta-generalization? In this work, we propose a novel meta-learning approach, called Meta-ticket, to find optimal sparse subnetworks for few-shot learning within randomly initialized NNs. We empirically validated that Meta-ticket successfully discover sparse subnetworks that can learn specialized features for each given task. Due to this task-wise adaptation ability, Meta-ticket achieves superior meta-generalization compared to MAML-based methods especially with large NNs.
    Variable importance without impossible data. (arXiv:2205.15750v1 [cs.LG])
    The most popular methods for measuring importance of the variables in a black box prediction algorithm make use of synthetic inputs that combine predictor variables from multiple subjects. These inputs can be unlikely, physically impossible, or even logically impossible. As a result, the predictions for such cases can be based on data very unlike any the black box was trained on. We think that users cannot trust an explanation of the decision of a prediction algorithm when the explanation uses such values. Instead we advocate a method called Cohort Shapley that is grounded in economic game theory and unlike most other game theoretic methods, it uses only actually observed data to quantify variable importance. Cohort Shapley works by narrowing the cohort of subjects judged to be similar to a target subject on one or more features. A feature is important if using it to narrow the cohort makes a large difference to the cohort mean. We illustrate it on an algorithmic fairness problem where it is essential to attribute importance to protected variables that the model was not trained on. For every subject and every predictor variable, we can compute the importance of that predictor to the subject's predicted response or to their actual response. These values can be aggregated, for example over all Black subjects, and we propose a Bayesian bootstrap to quantify uncertainty in both individual and aggregate Shapley values.
    Nonconvex regularization for sparse neural networks. (arXiv:2004.11515v2 [math.OC] UPDATED)
    Convex $\ell_1$ regularization using an infinite dictionary of neurons has been suggested for constructing neural networks with desired approximation guarantees, but can be affected by an arbitrary amount of over-parametrization. This can lead to a loss of sparsity and result in networks with too many active neurons for the given data, in particular if the number of data samples is large. As a remedy, in this paper, a nonconvex regularization method is investigated in the context of shallow ReLU networks: We prove that in contrast to the convex approach, any resulting (locally optimal) network is finite even in the presence of infinite data (i.e., if the data distribution is known and the limiting case of infinite samples is considered). Moreover, we show that approximation guarantees and existing bounds on the network size for finite data are maintained.
    VC Theoretical Explanation of Double Descent. (arXiv:2205.15549v1 [stat.ML])
    There has been growing interest in generalization performance of large multilayer neural networks that can be trained to achieve zero training error, while generalizing well on test data. This regime is known as 'second descent' and it appears to contradict conventional view that optimal model complexity should reflect optimal balance between underfitting and overfitting, aka the bias-variance trade-off. This paper presents VC-theoretical analysis of double descent and shows that it can be fully explained by classical VC generalization bounds. We illustrate an application of analytic VC-bounds for modeling double descent for classification problems, using empirical results for several learning methods, such as SVM, Least Squares, and Multilayer Perceptron classifiers. In addition, we discuss several possible reasons for misinterpretation of VC-theoretical results in the machine learning community.
    Robust Projection based Anomaly Extraction (RPE) in Univariate Time-Series. (arXiv:2205.15548v1 [stat.ML])
    This paper presents a novel, closed-form, and data/computation efficient online anomaly detection algorithm for time-series data. The proposed method, dubbed RPE, is a window-based method and in sharp contrast to the existing window-based methods, it is robust to the presence of anomalies in its window and it can distinguish the anomalies in time-stamp level. RPE leverages the linear structure of the trajectory matrix of the time-series and employs a robust projection step which makes the algorithm able to handle the presence of multiple arbitrarily large anomalies in its window. A closed-form/non-iterative algorithm for the robust projection step is provided and it is proved that it can identify the corrupted time-stamps. RPE is a great candidate for the applications where a large training data is not available which is the common scenario in the area of time-series. An extensive set of numerical experiments show that RPE can outperform the existing approaches with a notable margin.
    Learning brain MRI quality control: a multi-factorial generalization problem. (arXiv:2205.15898v1 [stat.ML])
    Due to the growing number of MRI data, automated quality control (QC) has become essential, especially for larger scale analysis. Several attempts have been made in order to develop reliable and scalable QC pipelines. However, the generalization of these methods on new data independent of those used for learning is a difficult problem because of the biases inherent in MRI data. This work aimed at evaluating the performances of the MRIQC pipeline on various large-scale datasets (ABIDE, N = 1102 and CATI derived datasets, N = 9037) used for both training and evaluation purposes. We focused our analysis on the MRIQC preprocessing steps and tested the pipeline with and without them. We further analyzed the site-wise and study-wise predicted classification probability distributions of the models without preprocessing trained on ABIDE and CATI data. Our main results were that a model using features extracted from MRIQC without preprocessing yielded the best results when trained and evaluated on large multi-center datasets with a heterogeneous population (an improvement of the ROC-AUC score on unseen data of 0.10 for the model trained on a subset of the CATI dataset). We concluded that a model trained with data from a heterogeneous population, such as the CATI dataset, provides the best scores on unseen data. In spite of the performance improvement, the generalization abilities of the models remain questionable when looking at the site-wise/study-wise probability predictions and the optimal classification threshold derived from them.
    Kymatio: Scattering Transforms in Python. (arXiv:1812.11214v3 [cs.LG] UPDATED)
    The wavelet scattering transform is an invariant signal representation suitable for many signal processing and machine learning applications. We present the Kymatio software package, an easy-to-use, high-performance Python implementation of the scattering transform in 1D, 2D, and 3D that is compatible with modern deep learning frameworks. All transforms may be executed on a GPU (in addition to CPU), offering a considerable speed up over CPU implementations. The package also has a small memory footprint, resulting inefficient memory usage. The source code, documentation, and examples are available undera BSD license at https://www.kymat.io/
    Nearly Minimax Optimal Offline Reinforcement Learning with Linear Function Approximation: Single-Agent MDP and Markov Game. (arXiv:2205.15512v1 [cs.LG])
    Offline reinforcement learning (RL) aims at learning an optimal strategy using a pre-collected dataset without further interactions with the environment. While various algorithms have been proposed for offline RL in the previous literature, the minimax optimal performance has only been (nearly) achieved for tabular Markov decision processes (MDPs). In this paper, we focus on offline RL with linear function approximation and propose two new algorithms, SPEVI+ and SPMVI+, for single-agent MDPs and two-player zero-sum Markov games (MGs), respectively. The proposed algorithms feature carefully crafted data splitting mechanisms and novel variance-reduction pessimistic estimators. Theoretical analysis demonstrates that they are capable of matching the performance lower bounds up to logarithmic factors. As a byproduct, a new performance lower bound is established for MGs, which tightens the existing results. To the best of our knowledge, these are the first computationally efficient and nearly minimax optimal algorithms for offline single-agent MDPs and MGs with linear function approximation.
    Feature Learning in $L_{2}$-regularized DNNs: Attraction/Repulsion and Sparsity. (arXiv:2205.15809v1 [stat.ML])
    We study the loss surface of DNNs with $L_{2}$ regularization. We show that the loss in terms of the parameters can be reformulated into a loss in terms of the layerwise activations $Z_{\ell}$ of the training set. This reformulation reveals the dynamics behind feature learning: each hidden representations $Z_{\ell}$ are optimal w.r.t. to an attraction/repulsion problem and interpolate between the input and output representations, keeping as little information from the input as necessary to construct the activation of the next layer. For positively homogeneous non-linearities, the loss can be further reformulated in terms of the covariances of the hidden representations, which takes the form of a partially convex optimization over a convex cone. This second reformulation allows us to prove a sparsity result for homogeneous DNNs: any local minimum of the $L_{2}$-regularized loss can be achieved with at most $N(N+1)$ neurons in each hidden layer (where $N$ is the size of the training set). We show that this bound is tight by giving an example of a local minimum which requires $N^{2}/4$ hidden neurons. But we also observe numerically that in more traditional settings much less than $N^{2}$ neurons are required to reach the minima.
    A Closer Look at Invalid Action Masking in Policy Gradient Algorithms. (arXiv:2006.14171v3 [cs.LG] UPDATED)
    In recent years, Deep Reinforcement Learning (DRL) algorithms have achieved state-of-the-art performance in many challenging strategy games. Because these games have complicated rules, an action sampled from the full discrete action distribution predicted by the learned policy is likely to be invalid according to the game rules (e.g., walking into a wall). The usual approach to deal with this problem in policy gradient algorithms is to "mask out" invalid actions and just sample from the set of valid actions. The implications of this process, however, remain under-investigated. In this paper, we 1) show theoretical justification for such a practice, 2) empirically demonstrate its importance as the space of invalid actions grows, and 3) provide further insights by evaluating different action masking regimes, such as removing masking after an agent has been trained using masking. The source code can be found at https://github.com/vwxyzjn/invalid-action-masking
    A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification. (arXiv:2107.07511v4 [cs.LG] UPDATED)
    Black-box machine learning learning methods are now routinely used in high-risk settings, like medical diagnostics, which demand uncertainty quantification to avoid consequential model failures. Distribution-free uncertainty quantification (distribution-free UQ) is a user-friendly paradigm for creating statistically rigorous confidence intervals/sets for such predictions. Critically, the intervals/sets are valid without distributional assumptions or model assumptions, possessing explicit guarantees even with finitely many datapoints. Moreover, they adapt to the difficulty of the input; when the input example is difficult, the uncertainty intervals/sets are large, signaling that the model might be wrong. Without much work and without retraining, one can use distribution-free methods on any underlying algorithm, such as a neural network, to produce confidence sets guaranteed to contain the ground truth with a user-specified probability, such as 90%. Indeed, the methods are easy-to-understand and general, applying to many modern prediction problems arising in the fields of computer vision, natural language processing, deep reinforcement learning, and so on. This hands-on introduction is aimed at a reader interested in the practical implementation of distribution-free UQ who is not necessarily a statistician. We lead the reader through the practical theory and applications of distribution-free UQ, beginning with conformal prediction and culminating with distribution-free control of any risk, such as the false-discovery rate, false positive rate of out-of-distribution detection, and so on. We will include many explanatory illustrations, examples, and code samples in Python, with PyTorch syntax. The goal is to provide the reader a working understanding of distribution-free UQ, allowing them to put confidence intervals on their algorithms, with one self-contained document.
    Simulation-Based Inference with WALDO: Perfectly Calibrated Confidence Regions Using Any Prediction or Posterior Estimation Algorithm. (arXiv:2205.15680v1 [stat.ML])
    The vast majority of modern machine learning targets prediction problems, with algorithms such as Deep Neural Networks revolutionizing the accuracy of point predictions for high-dimensional complex data. Predictive approaches are now used in many domain sciences to directly estimate internal parameters of interest in theoretical simulator-based models. In parallel, common alternatives focus on estimating the full posterior using modern neural density estimators such as normalizing flows. However, an open problem in simulation-based inference (SBI) is how to construct properly calibrated confidence regions for internal parameters with nominal conditional coverage and high power. Many SBI methods are indeed known to produce overly confident posterior approximations, yielding misleading uncertainty estimates. Similarly, existing approaches for uncertainty quantification in deep learning provide no guarantees on conditional coverage. In this work, we present WALDO, a novel method for constructing correctly calibrated confidence regions in SBI. WALDO reframes the well-known Wald test and uses Neyman inversion to convert point predictions and posteriors from any prediction or posterior estimation algorithm to confidence sets with correct conditional coverage, even for finite sample sizes. As a concrete example, we demonstrate how a recently proposed deep learning prediction approach for particle energies in high-energy physics can be recalibrated using WALDO to produce confidence intervals with correct coverage and high power.
    Infinite-dimensional optimization and Bayesian nonparametric learning of stochastic differential equations. (arXiv:2205.15368v1 [stat.ML])
    The paper has two major themes. The first part of the paper establishes certain general results for infinite-dimensional optimization problems on Hilbert spaces. These results cover the classical representer theorem and many of its variants as special cases and offer a wider scope of applications. The second part of the paper then develops a systematic approach for learning the drift function of a stochastic differential equation by integrating the results of the first part with Bayesian hierarchical framework. Importantly, our Baysian approach incorporates low-cost sparse learning through proper use of shrinkage priors while allowing proper quantification of uncertainty through posterior distributions. Several examples at the end illustrate the accuracy of our learning scheme.
    Minimax Classification under Concept Drift with Multidimensional Adaptation and Performance Guarantees. (arXiv:2205.15942v1 [stat.ML])
    The statistical characteristics of instance-label pairs often change with time in practical scenarios of supervised classification. Conventional learning techniques adapt to such concept drift accounting for a scalar rate of change by means of a carefully chosen learning rate, forgetting factor, or window size. However, the time changes in common scenarios are multidimensional, i.e., different statistical characteristics often change in a different manner. This paper presents adaptive minimax risk classifiers (AMRCs) that account for multidimensional time changes by means of a multivariate and high-order tracking of the time-varying underlying distribution. In addition, differently from conventional techniques, AMRCs can provide computable tight performance guarantees. Experiments on multiple benchmark datasets show the classification improvement of AMRCs compared to the state-of-the-art and the reliability of the presented performance guarantees.
    Data Banzhaf: A Data Valuation Framework with Maximal Robustness to Learning Stochasticity. (arXiv:2205.15466v1 [cs.LG])
    This paper studies the robustness of data valuation to noisy model performance scores. Particularly, we find that the inherent randomness of the widely used stochastic gradient descent can cause existing data value notions (e.g., the Shapley value and the Leave-one-out error) to produce inconsistent data value rankings across different runs. To address this challenge, we first pose a formal framework within which one can measure the robustness of a data value notion. We show that the Banzhaf value, a value notion originated from cooperative game theory literature, achieves the maximal robustness among all semivalues -- a class of value notions that satisfy crucial properties entailed by ML applications. We propose an algorithm to efficiently estimate the Banzhaf value based on the Maximum Sample Reuse (MSR) principle. We derive the lower bound sample complexity for Banzhaf value approximation, and we show that our MSR algorithm's sample complexity nearly matches the lower bound. Our evaluation demonstrates that the Banzhaf value outperforms the existing semivalue-based data value notions on several downstream ML tasks such as learning with weighted samples and noisy label detection. Overall, our study suggests that when the underlying ML algorithm is stochastic, the Banzhaf value is a promising alternative to the semivalue-based data value schemes given its computational advantage and ability to robustly differentiate data quality.
    One Policy is Enough: Parallel Exploration with a Single Policy is Minimax Optimal for Reward-Free Reinforcement Learning. (arXiv:2205.15891v1 [cs.LG])
    While parallelism has been extensively used in Reinforcement Learning (RL), the quantitative effects of parallel exploration are not well understood theoretically. We study the benefits of simple parallel exploration for reward-free RL for linear Markov decision processes (MDPs) and two-player zero-sum Markov games (MGs). In contrast to the existing literature focused on approaches that encourage agents to explore over a diverse set of policies, we show that using a single policy to guide exploration across all agents is sufficient to obtain an almost-linear speedup in all cases compared to their fully sequential counterpart. Further, we show that this simple procedure is minimax optimal up to logarithmic factors in the reward-free setting for both linear MDPs and two-player zero-sum MGs. From a practical perspective, our paper shows that a single policy is sufficient and provably optimal for incorporating parallelism during the exploration phase.
    Inducing bias is simpler than you think. (arXiv:2205.15935v1 [cs.LG])
    Machine learning may be oblivious to human bias but it is not immune to its perpetuation. Marginalisation and iniquitous group representation are often traceable in the very data used for training, and may be reflected or even enhanced by the learning models. To counter this, some of the model accuracy can be traded off for a secondary objective that helps prevent a specific type of bias. Multiple notions of fairness have been proposed to this end but recent studies show that some fairness criteria often stand in mutual competition. In the present work, we introduce a solvable high-dimensional model of data imbalance, where parametric control over the many bias-inducing factors allows for an extensive exploration of the bias inheritance mechanism. Through the tools of statistical physics, we analytically characterise the typical behaviour of learning models trained in our synthetic framework and find similar unfairness behaviours as those observed on more realistic data. However, we also identify a positive transfer effect between the different subpopulations within the data. This suggests that mixing data with different statistical properties could be helpful, provided the learning model is made aware of this structure. Finally, we analyse the issue of bias mitigation: by reweighing the various terms in the training loss, we indirectly minimise standard unfairness metrics and highlight their incompatibilities. Leveraging the insights on positive transfer, we also propose a theory-informed mitigation strategy, based on the introduction of coupled learning models. By allowing each model to specialise on a different community within the data, we find that multiple fairness criteria and high accuracy can be achieved simultaneously.
    Evaluating Robustness to Dataset Shift via Parametric Robustness Sets. (arXiv:2205.15947v1 [cs.LG])
    We give a method for proactively identifying small, plausible shifts in distribution which lead to large differences in model performance. To ensure that these shifts are plausible, we parameterize them in terms of interpretable changes in causal mechanisms of observed variables. This defines a parametric robustness set of plausible distributions and a corresponding worst-case loss. While the loss under an individual parametric shift can be estimated via reweighting techniques such as importance sampling, the resulting worst-case optimization problem is non-convex, and the estimate may suffer from large variance. For small shifts, however, we can construct a local second-order approximation to the loss under shift and cast the problem of finding a worst-case shift as a particular non-convex quadratic optimization problem, for which efficient algorithms are available. We demonstrate that this second-order approximation can be estimated directly for shifts in conditional exponential family models, and we bound the approximation error. We apply our approach to a computer vision task (classifying gender from images), revealing sensitivity to shifts in non-causal attributes.
    Posterior and Computational Uncertainty in Gaussian Processes. (arXiv:2205.15449v1 [cs.LG])
    Gaussian processes scale prohibitively with the size of the dataset. In response, many approximation methods have been developed, which inevitably introduce approximation error. This additional source of uncertainty, due to limited computation, is entirely ignored when using the approximate posterior. Therefore in practice, GP models are often as much about the approximation method as they are about the data. Here, we develop a new class of methods that provides consistent estimation of the combined uncertainty arising from both the finite number of data observed and the finite amount of computation expended. The most common GP approximations map to an instance in this class, such as methods based on the Cholesky factorization, conjugate gradients, and inducing points. For any method in this class, we prove (i) convergence of its posterior mean in the associated RKHS, (ii) decomposability of its combined posterior covariance into mathematical and computational covariances, and (iii) that the combined variance is a tight worst-case bound for the squared error between the method's posterior mean and the latent function. Finally, we empirically demonstrate the consequences of ignoring computational uncertainty and show how implicitly modeling it improves generalization performance on benchmark datasets.
    Fast Predictive Uncertainty for Classification with Bayesian Deep Networks. (arXiv:2003.01227v4 [cs.LG] UPDATED)
    In Bayesian Deep Learning, distributions over the output of classification neural networks are often approximated by first constructing a Gaussian distribution over the weights, then sampling from it to receive a distribution over the softmax outputs. This is costly. We reconsider old work (Laplace Bridge) to construct a Dirichlet approximation of this softmax output distribution, which yields an analytic map between Gaussian distributions in logit space and Dirichlet distributions (the conjugate prior to the Categorical distribution) in the output space. Importantly, the vanilla Laplace Bridge comes with certain limitations. We analyze those and suggest a simple solution that compares favorably to other commonly used estimates of the softmax-Gaussian integral. We demonstrate that the resulting Dirichlet distribution has multiple advantages, in particular, more efficient computation of the uncertainty estimate and scaling to large datasets and networks like ImageNet and DenseNet. We further demonstrate the usefulness of this Dirichlet approximation by using it to construct a lightweight uncertainty-aware output ranking for ImageNet.
    Attribution-based Explanations that Provide Recourse Cannot be Robust. (arXiv:2205.15834v1 [stat.ML])
    Different users of machine learning methods require different explanations, depending on their goals. To make machine learning accountable to society, one important goal is to get actionable options for recourse, which allow an affected user to change the decision $f(x)$ of a machine learning system by making limited changes to its input $x$. We formalize this by providing a general definition of recourse sensitivity, which needs to be instantiated with a utility function that describes which changes to the decisions are relevant to the user. This definition applies to local attribution methods, which attribute an importance weight to each input feature. It is often argued that such local attributions should be robust, in the sense that a small change in the input $x$ that is being explained, should not cause a large change in the feature weights. However, we prove formally that it is in general impossible for any single attribution method to be both recourse sensitive and robust at the same time. It follows that there must always exist counterexamples to at least one of these properties. We provide such counterexamples for several popular attribution methods, including LIME, SHAP, Integrated Gradients and SmoothGrad. Our results also cover counterfactual explanations, which may be viewed as attributions that describe a perturbation of $x$. We further discuss possible ways to work around our impossibility result, for instance by allowing the output to consist of sets with multiple attributions. Finally, we strengthen our impossibility result for the restricted case where users are only able to change a single attribute of x, by providing an exact characterization of the functions $f$ to which impossibility applies.
    coVariance Neural Networks. (arXiv:2205.15856v1 [cs.LG])
    Graph neural networks (GNN) are an effective framework that exploit inter-relationships within graph-structured data for learning. Principal component analysis (PCA) involves the projection of data on the eigenspace of the covariance matrix and draws similarities with the graph convolutional filters in GNNs. Motivated by this observation, we propose a GNN architecture, called coVariance neural network (VNN), that operates on sample covariance matrices as graphs. We theoretically establish the stability of VNNs to perturbations in the covariance matrix, thus, implying an advantage over standard PCA-based data analysis approaches that are prone to instability due to principal components associated with close eigenvalues. Our experiments on real-world datasets validate our theoretical results and show that VNN performance is indeed more stable than PCA-based statistical approaches. Moreover, our experiments on multi-resolution datasets also demonstrate that VNNs are amenable to transferability of performance over covariance matrices of different dimensions; a feature that is infeasible for PCA-based approaches.
    Optimal Best Arm Identification in Two-Armed Bandits with a Fixed Budget under a Small Gap. (arXiv:2201.04469v6 [stat.ML] UPDATED)
    We consider fixed-budget best arm identification in two-armed bandit problems. One of the longstanding open questions is a tight lower bound on the probability of misidentifying the best arm and a strategy whose upper bound matches the lower bound when the optimal target allocation ratio of arm draws is unknown. We address this problem when the gap between the expected rewards is small. First, we introduce a distribution-dependent lower bound. Then, we propose the ``RS-AIPW'' strategy, which consists of the random sampling (RS) rule using the estimated optimal target allocation ratio and the recommendation rule using the augmented inverse probability weighting (AIPW) estimator. Our proposed strategy is optimal in the sense that the upper bound achieves the lower bound when the budget goes to infinity and the gap goes to zero. In the course of the analysis, we present a novel large deviation bound for martingales.
    Minimax Optimal Online Imitation Learning via Replay Estimation. (arXiv:2205.15397v1 [cs.LG])
    Online imitation learning is the problem of how best to mimic expert demonstrations, given access to the environment or an accurate simulator. Prior work has shown that in the infinite sample regime, exact moment matching achieves value equivalence to the expert policy. However, in the finite sample regime, even if one has no optimization error, empirical variance can lead to a performance gap that scales with $H^2 / N$ for behavioral cloning and $H / \sqrt{N}$ for online moment matching, where $H$ is the horizon and $N$ is the size of the expert dataset. We introduce the technique of replay estimation to reduce this empirical variance: by repeatedly executing cached expert actions in a stochastic simulator, we compute a smoother expert visitation distribution estimate to match. In the presence of general function approximation, we prove a meta theorem reducing the performance gap of our approach to the parameter estimation error for offline classification (i.e. learning the expert policy). In the tabular setting or with linear function approximation, our meta theorem shows that the performance gap incurred by our approach achieves the optimal $\widetilde{O} \left( \min({H^{3/2}} / {N}, {H} / {\sqrt{N}} \right)$ dependency, under significantly weaker assumptions compared to prior work. We implement multiple instantiations of our approach on several continuous control tasks and find that we are able to significantly improve policy performance across a variety of dataset sizes.
    The CLRS Algorithmic Reasoning Benchmark. (arXiv:2205.15659v1 [cs.LG])
    Learning representations of algorithms is an emerging area of machine learning, seeking to bridge concepts from neural networks with classical algorithms. Several important works have investigated whether neural networks can effectively reason like algorithms, typically by learning to execute them. The common trend in the area, however, is to generate targeted kinds of algorithmic data to evaluate specific hypotheses, making results hard to transfer across publications, and increasing the barrier of entry. To consolidate progress and work towards unified evaluation, we propose the CLRS Algorithmic Reasoning Benchmark, covering classical algorithms from the Introduction to Algorithms textbook. Our benchmark spans a variety of algorithmic reasoning procedures, including sorting, searching, dynamic programming, graph algorithms, string algorithms and geometric algorithms. We perform extensive experiments to demonstrate how several popular algorithmic reasoning baselines perform on these tasks, and consequently, highlight links to several open challenges. Our library is readily available at https://github.com/deepmind/clrs.  ( 2 min )
    Polynomial-time Sparse Deconvolution. (arXiv:2204.07879v2 [cs.LG] UPDATED)
    How can a probability measure be recovered with sparse support from its generalized moments? This problem, called sparse deconvolution, has been the focus of research in mathematics, theoretical computer science, and neural computing. However, there is no polynomial-time algorithm for the recovery. The best algorithm requires $O\left(\text{dimension}^{\text{poly}(1/\epsilon)}\right)$ for $\epsilon$-accurate recovery. We propose the first poly-time recovery method from carefully designed moments that requires $O\left(\text{dimension}^4\log(1/\epsilon)/\epsilon^2\right)$ computations for an $\epsilon$-accurate recovery. This method relies on learning a planted two-layer neural network with two-dimensional inputs, a finite width, and zero-one activation. For learning such networks, we establish the first poly-time complexity, and demonstrate its application in sparse deconvolution.  ( 2 min )
    PDE-based Group Equivariant Convolutional Neural Networks. (arXiv:2001.09046v6 [cs.LG] UPDATED)
    We present a PDE-based framework that generalizes Group equivariant Convolutional Neural Networks (G-CNNs). In this framework, a network layer is seen as a set of PDE-solvers where geometrically meaningful PDE-coefficients become the layer's trainable weights. Formulating our PDEs on homogeneous spaces allows these networks to be designed with built-in symmetries such as rotation in addition to the standard translation equivariance of CNNs. Having all the desired symmetries included in the design obviates the need to include them by means of costly techniques such as data augmentation. We will discuss our PDE-based G-CNNs (PDE-G-CNNs) in a general homogeneous space setting while also going into the specifics of our primary case of interest: roto-translation equivariance. We solve the PDE of interest by a combination of linear group convolutions and non-linear morphological group convolutions with analytic kernel approximations that we underpin with formal theorems. Our kernel approximations allow for fast GPU-implementation of the PDE-solvers, we release our implementation with this article in the form of the LieTorch extension to PyTorch, available at https://gitlab.com/bsmetsjr/lietorch . Just like for linear convolution a morphological convolution is specified by a kernel that we train in our PDE-G-CNNs. In PDE-G-CNNs we do not use non-linearities such as max/min-pooling and ReLUs as they are already subsumed by morphological convolutions. We present a set of experiments to demonstrate the strength of the proposed PDE-G-CNNs in increasing the performance of deep learning based imaging applications with far fewer parameters than traditional CNNs.  ( 3 min )
    Online Meta-Learning in Adversarial Multi-Armed Bandits. (arXiv:2205.15921v1 [cs.LG])
    We study meta-learning for adversarial multi-armed bandits. We consider the online-within-online setup, in which a player (learner) encounters a sequence of multi-armed bandit episodes. The player's performance is measured as regret against the best arm in each episode, according to the losses generated by an adversary. The difficulty of the problem depends on the empirical distribution of the per-episode best arm chosen by the adversary. We present an algorithm that can leverage the non-uniformity in this empirical distribution, and derive problem-dependent regret bounds. This solution comprises an inner learner that plays each episode separately, and an outer learner that updates the hyper-parameters of the inner algorithm between the episodes. In the case where the best arm distribution is far from uniform, it improves upon the best bound that can be achieved by any online algorithm executed on each episode individually without meta-learning.  ( 2 min )
    Will Bilevel Optimizers Benefit from Loops. (arXiv:2205.14224v2 [cs.LG] UPDATED)
    Bilevel optimization has arisen as a powerful tool for solving a variety of machine learning problems. Two current popular bilevel optimizers AID-BiO and ITD-BiO naturally involve solving one or two sub-problems, and consequently, whether we solve these problems with loops (that take many iterations) or without loops (that take only a few iterations) can significantly affect the overall computational efficiency. Existing studies in the literature cover only some of those implementation choices, and the complexity bounds available are not refined enough to enable rigorous comparison among different implementations. In this paper, we first establish unified convergence analysis for both AID-BiO and ITD-BiO that are applicable to all implementation choices of loops. We then specialize our results to characterize the computational complexity for all implementations, which enable an explicit comparison among them. Our result indicates that for AID-BiO, the loop for estimating the optimal point of the inner function is beneficial for overall efficiency, although it causes higher complexity for each update step, and the loop for approximating the outer-level Hessian-inverse-vector product reduces the gradient complexity. For ITD-BiO, the two loops always coexist, and our convergence upper and lower bounds show that such loops are necessary to guarantee a vanishing convergence error, whereas the no-loop scheme suffers from an unavoidable non-vanishing convergence error. Our numerical experiments further corroborate our theoretical results.  ( 2 min )
    Variational inference via Wasserstein gradient flows. (arXiv:2205.15902v1 [stat.ML])
    Along with Markov chain Monte Carlo (MCMC) methods, variational inference (VI) has emerged as a central computational approach to large-scale Bayesian inference. Rather than sampling from the true posterior $\pi$, VI aims at producing a simple but effective approximation $\hat \pi$ to $\pi$ for which summary statistics are easy to compute. However, unlike the well-studied MCMC methodology, VI is still poorly understood and dominated by heuristics. In this work, we propose principled methods for VI, in which $\hat \pi$ is taken to be a Gaussian or a mixture of Gaussians, which rest upon the theory of gradient flows on the Bures-Wasserstein space of Gaussian measures. Akin to MCMC, it comes with strong theoretical guarantees when $\pi$ is log-concave.  ( 2 min )
    Optimally adaptive Bayesian spectral density estimation for stationary and nonstationary processes. (arXiv:2003.02367v3 [stat.ME] UPDATED)
    This article improves on existing methods to estimate the spectral density of stationary and nonstationary time series assuming a Gaussian process prior. By optimising an appropriate eigendecomposition using a smoothing spline covariance structure, our method more appropriately models data with both simple and complex periodic structure. We further justify the utility of this optimal eigendecomposition by investigating the performance of alternative covariance functions other than smoothing splines. We show that the optimal eigendecomposition provides a material improvement, while the other covariance functions under examination do not, all performing comparatively well as the smoothing spline. During our computational investigation, we introduce new validation metrics for the spectral density estimate, inspired from the physical sciences. We validate our models in an extensive simulation study and demonstrate superior performance with real data.  ( 2 min )
    Unbalanced CO-Optimal Transport. (arXiv:2205.14923v2 [stat.ML] UPDATED)
    Optimal transport (OT) compares probability distributions by computing a meaningful alignment between their samples. CO-optimal transport (COOT) takes this comparison further by inferring an alignment between features as well. While this approach leads to better alignments and generalizes both OT and Gromov-Wasserstein distances, we provide a theoretical result showing that it is sensitive to outliers that are omnipresent in real-world data. This prompts us to propose unbalanced COOT for which we provably show its robustness to noise in the compared datasets. To the best of our knowledge, this is the first such result for OT methods in incomparable spaces. With this result in hand, we provide empirical evidence of this robustness for the challenging tasks of heterogeneous domain adaptation with and without varying proportions of classes and simultaneous alignment of samples and features across single-cell measurements.  ( 2 min )
    Few-Shot Diffusion Models. (arXiv:2205.15463v1 [cs.CV])
    Denoising diffusion probabilistic models (DDPM) are powerful hierarchical latent variable models with remarkable sample generation quality and training stability. These properties can be attributed to parameter sharing in the generative hierarchy, as well as a parameter-free diffusion-based inference procedure. In this paper, we present Few-Shot Diffusion Models (FSDM), a framework for few-shot generation leveraging conditional DDPMs. FSDMs are trained to adapt the generative process conditioned on a small set of images from a given class by aggregating image patch information using a set-based Vision Transformer (ViT). At test time, the model is able to generate samples from previously unseen classes conditioned on as few as 5 samples from that class. We empirically show that FSDM can perform few-shot generation and transfer to new datasets. We benchmark variants of our method on complex vision datasets for few-shot learning and compare to unconditional and conditional DDPM baselines. Additionally, we show how conditioning the model on patch-based input set information improves training convergence.  ( 2 min )
    QLSD: Quantised Langevin stochastic dynamics for Bayesian federated learning. (arXiv:2106.00797v3 [cs.LG] UPDATED)
    The objective of Federated Learning (FL) is to perform statistical inference for data which are decentralised and stored locally on networked clients. FL raises many constraints which include privacy and data ownership, communication overhead, statistical heterogeneity, and partial client participation. In this paper, we address these problems in the framework of the Bayesian paradigm. To this end, we propose a novel federated Markov Chain Monte Carlo algorithm, referred to as Quantised Langevin Stochastic Dynamics which may be seen as an extension to the FL setting of Stochastic Gradient Langevin Dynamics, which handles the communication bottleneck using gradient compression. To improve performance, we then introduce variance reduction techniques, which lead to two improved versions coined \texttt{QLSD}$^{\star}$ and \texttt{QLSD}$^{++}$. We give both non-asymptotic and asymptotic convergence guarantees for the proposed algorithms. We illustrate their performances using various Bayesian Federated Learning benchmarks.  ( 2 min )
    Smoothed Online Learning is as Easy as Statistical Learning. (arXiv:2202.04690v3 [stat.ML] UPDATED)
    Much of modern learning theory has been split between two regimes: the classical offline setting, where data arrive independently, and the online setting, where data arrive adversarially. While the former model is often both computationally and statistically tractable, the latter requires no distributional assumptions. In an attempt to achieve the best of both worlds, previous work proposed the smooth online setting where each sample is drawn from an adversarially chosen distribution, which is smooth, i.e., it has a bounded density with respect to a fixed dominating measure. We provide tight bounds on the minimax regret of learning a nonparametric function class, with nearly optimal dependence on both the horizon and smoothness parameters. Furthermore, we provide the first oracle-efficient, no-regret algorithms in this setting. In particular, we propose an oracle-efficient improper algorithm whose regret achieves optimal dependence on the horizon and a proper algorithm requiring only a single oracle call per round whose regret has the optimal horizon dependence in the classification setting and is sublinear in general. Both algorithms have exponentially worse dependence on the smoothness parameter of the adversary than the minimax rate. We then prove a lower bound on the oracle complexity of any proper learning algorithm, which matches the oracle-efficient upper bounds up to a polynomial factor, thus demonstrating the existence of a statistical-computational gap in smooth online learning. Finally, we apply our results to the contextual bandit setting to show that if a function class is learnable in the classical setting, then there is an oracle-efficient, no-regret algorithm for contextual bandits in the case that contexts arrive in a smooth manner.  ( 2 min )
    Regret Bounds and Reinforcement Learning Exploration of EXP-based Algorithms. (arXiv:2009.09538v2 [cs.LG] UPDATED)
    EXP-based algorithms are often used for exploration in non-stochastic bandit problems assuming rewards are bounded. We propose a new algorithm, namely EXP4.P, by modifying EXP4 and establish its upper bound of regret in both bounded and unbounded sub-Gaussian contextual bandit settings. The unbounded reward result also holds for a revised version of EXP3.P. Moreover, we provide a lower bound on regret that suggests no sublinear regret can be achieved given short time horizon. All the analyses do not require bounded rewards compared to classical ones. We also extend EXP4.P from contextual bandit to reinforcement learning to incentivize exploration by multiple agents given black-box rewards. The resulting algorithm has been tested on hard-to-explore games and it shows an improvement on exploration compared to state-of-the-art.  ( 2 min )
    Cross-view kernel transfer. (arXiv:1910.05964v2 [cs.LG] UPDATED)
    We consider the kernel completion problem with the presence of multiple views in the data. In this context the data samples can be fully missing in some views, creating missing columns and rows to the kernel matrices that are calculated individually for each view. We propose to solve the problem of completing the kernel matrices with Cross-View Kernel Transfer (CVKT) procedure, in which the features of the other views are transformed to represent the view under consideration. The transformations are learned with kernel alignment to the known part of the kernel matrix, allowing for finding generalizable structures in the kernel matrix under completion. Its missing values can then be predicted with the data available in other views. We illustrate the benefits of our approach with simulated data, multivariate digits dataset and multi-view dataset on gesture classification, as well as with real biological datasets from studies of pattern formation in early \textit{Drosophila melanogaster} embryogenesis.  ( 2 min )
    Learning (Very) Simple Generative Models Is Hard. (arXiv:2205.16003v1 [cs.LG])
    Motivated by the recent empirical successes of deep generative models, we study the computational complexity of the following unsupervised learning problem. For an unknown neural network $F:\mathbb{R}^d\to\mathbb{R}^{d'}$, let $D$ be the distribution over $\mathbb{R}^{d'}$ given by pushing the standard Gaussian $\mathcal{N}(0,\textrm{Id}_d)$ through $F$. Given i.i.d. samples from $D$, the goal is to output any distribution close to $D$ in statistical distance. We show under the statistical query (SQ) model that no polynomial-time algorithm can solve this problem even when the output coordinates of $F$ are one-hidden-layer ReLU networks with $\log(d)$ neurons. Previously, the best lower bounds for this problem simply followed from lower bounds for supervised learning and required at least two hidden layers and $\mathrm{poly}(d)$ neurons [Daniely-Vardi '21, Chen-Gollakota-Klivans-Meka '22]. The key ingredient in our proof is an ODE-based construction of a compactly supported, piecewise-linear function $f$ with polynomially-bounded slopes such that the pushforward of $\mathcal{N}(0,1)$ under $f$ matches all low-degree moments of $\mathcal{N}(0,1)$.  ( 2 min )
    Improvements to Supervised EM Learning of Shared Kernel Models by Feature Space Partitioning. (arXiv:2205.15304v1 [cs.LG])
    Expectation maximisation (EM) is usually thought of as an unsupervised learning method for estimating the parameters of a mixture distribution, however it can also be used for supervised learning when class labels are available. As such, EM has been applied to train neural nets including the probabilistic radial basis function (PRBF) network or shared kernel (SK) model. This paper addresses two major shortcomings of previous work in this area: the lack of rigour in the derivation of the EM training algorithm; and the computational complexity of the technique, which has limited it to low dimensional data sets. We first present a detailed derivation of EM for the Gaussian shared kernel model PRBF classifier, making use of data association theory to obtain the complete data likelihood, Baum's auxiliary function (the E-step) and its subsequent maximisation (M-step). To reduce complexity of the resulting SKEM algorithm, we partition the feature space into $R$ non-overlapping subsets of variables. The resulting product decomposition of the joint data likelihood, which is exact when the feature partitions are independent, allows the SKEM to be implemented in parallel and at $R^2$ times lower complexity. The operation of the partitioned SKEM algorithm is demonstrated on the MNIST data set and compared with its non-partitioned counterpart. It eventuates that improved performance at reduced complexity is achievable. Comparisons with standard classification algorithms are provided on a number of other benchmark data sets.  ( 2 min )
    Optimal Transport of Classifiers to Fairness. (arXiv:2202.03814v2 [cs.LG] UPDATED)
    In past work on fairness in machine learning, the focus has been on forcing the prediction of classifiers to have similar statistical properties for people of different demographics. To reduce the violation of these properties, fairness methods usually simply rescale the classifier scores, ignoring similarities and dissimilarities between members of different groups. Yet, we hypothesize that such information is relevant in quantifying the unfairness of a given classifier. To validate this hypothesis, we introduce Optimal Transport to Fairness (OTF), a method that quantifies the violation of fairness constraints as the smallest Optimal Transport cost between a probabilistic classifier and any score function that satisfies these constraints. For a flexible class of linear fairness constraints, we construct a practical way to compute OTF as a differentiable fairness regularizer that can be added to any standard classification setting. Experiments show that OTF can be used to achieve an improved trade-off between predictive power and fairness.  ( 2 min )
    Ensemble methods for survival function estimation with time-varying covariates. (arXiv:2006.00567v6 [stat.AP] UPDATED)
    Survival data with time-varying covariates are common in practice. If relevant, they can improve on the estimation of survival function. However, the traditional survival forests - conditional inference forest, relative risk forest and random survival forest - have accommodated only time-invariant covariates. We generalize the conditional inference and relative risk forests to allow time-varying covariates. We also propose a general framework for estimation of a survival function in the presence of time-varying covariates. We compare their performance with that of the Cox model and transformation forest, adapted here to accommodate time-varying covariates, through a comprehensive simulation study in which the Kaplan-Meier estimate serves as a benchmark, and performance is compared using the integrated L2 difference between the true and estimated survival functions. In general, the performance of the two proposed forests substantially improves over the Kaplan-Meier estimate. Taking into account all other factors, under the proportional hazard (PH) setting, the best method is always one of the two proposed forests, while under the non-PH setting, it is the adapted transformation forest. K-fold cross-validation is used as an effective tool to choose between the methods in practice.  ( 2 min )
    Critic Sequential Monte Carlo. (arXiv:2205.15460v1 [stat.ML])
    We introduce CriticSMC, a new algorithm for planning as inference built from a novel composition of sequential Monte Carlo with learned soft-Q function heuristic factors. This algorithm is structured so as to allow using large numbers of putative particles leading to efficient utilization of computational resource and effective discovery of high reward trajectories even in environments with difficult reward surfaces such as those arising from hard constraints. Relative to prior art our approach is notably still compatible with model-free reinforcement learning in the sense that the implicit policy we produce can be used at test time in the absence of a world model. Our experiments on self-driving car collision avoidance in simulation demonstrate improvements against baselines in terms of infraction minimization relative to computational effort while maintaining diversity and realism of found trajectories.  ( 2 min )
    Likelihood-Free Inference with Generative Neural Networks via Scoring Rule Minimization. (arXiv:2205.15784v1 [stat.CO])
    Bayesian Likelihood-Free Inference methods yield posterior approximations for simulator models with intractable likelihood. Recently, many works trained neural networks to approximate either the intractable likelihood or the posterior directly. Most proposals use normalizing flows, namely neural networks parametrizing invertible maps used to transform samples from an underlying base measure; the probability density of the transformed samples is then accessible and the normalizing flow can be trained via maximum likelihood on simulated parameter-observation pairs. A recent work [Ramesh et al., 2022] approximated instead the posterior with generative networks, which drop the invertibility requirement and are thus a more flexible class of distributions scaling to high-dimensional and structured data. However, generative networks only allow sampling from the parametrized distribution; for this reason, Ramesh et al. [2022] follows the common solution of adversarial training, where the generative network plays a min-max game against a "critic" network. This procedure is unstable and can lead to a learned distribution underestimating the uncertainty - in extreme cases collapsing to a single point. Here, we propose to approximate the posterior with generative networks trained by Scoring Rule minimization, an overlooked adversarial-free method enabling smooth training and better uncertainty quantification. In simulation studies, the Scoring Rule approach yields better performances with shorter training time with respect to the adversarial framework.  ( 2 min )
    Holistic Generalized Linear Models. (arXiv:2205.15447v1 [stat.ML])
    Holistic linear regression extends the classical best subset selection problem by adding additional constraints designed to improve the model quality. These constraints include sparsity-inducing constraints, sign-coherence constraints and linear constraints. The $\textsf{R}$ package $\texttt{holiglm}$ provides functionality to model and fit holistic generalized linear models. By making use of state-of-the-art conic mixed-integer solvers, the package can reliably solve GLMs for Gaussian, binomial and Poisson responses with a multitude of holistic constraints. The high-level interface simplifies the constraint specification and can be used as a drop-in replacement for the $\texttt{stats::glm()}$ function.  ( 2 min )
  • Open

    Qualitative humanities research is crucial to AI
    “All research is qualitative; some is also quantitative” Harvard Social Scientist and Statistician Gary King Suppose you wanted to find out whether a machine learning system being adopted - to recruit candidates, lend money, or predict future criminality - exhibited racial bias. You might calculate model performance across groups with different races. But how was race categorised– through a census record, a police officer’s guess, or by an annotator? Each possible answer raises another set of questions. Following the thread of any seemingly quantitative issue around AI ethics quickly leads to a host of qualitative questions. Throughout AI, qualitative decisions are made about what metrics to optimise for, which categories to use, how to define their bounds, who applies the labels. Simila…  ( 8 min )

  • Open

    How to Optimize your AI Models at Inference Time
    submitted by /u/aidev2040 [link] [comments]
    New AI Robot Hand Designer Creates Hyper Efficient Manipulators | Self Driving Autonomous Vehicle AI For Spacecraft | Coral Reef AI Tool
    submitted by /u/tohelpyou88 [link] [comments]  ( 1 min )
    Why are neural networks not multi-dimensional?
    Title says almost everything. I believe this might solve sequential deep learning once and for all (as opposed to current methods such as LSTM, the architecture of which I find arbitrary). Although I can envision multi-dimensional neural networks in multiple shapes, I will elaborate on what I think is the most straight-forward shape: ​ Input Imagine the most widely used example of sequential data: The moving ball. Input data is 2d, structured in the following way: Axis 0: [x-coordinate, y-coordinate] (not to confuse with X as in input and y as in output) Axis 1: [timestep 1, ..., timestep k] ​ As such, one would end up with the following 2d-array as input: [[x-coordinate, y-coordinate]t-k, ..., [x-coordinate, y-coordinate]t0] ​ Hidden layers Now imagine the following 3d structure used as hidden layers: Axis 0: A single hidden layer, i.e., [neuron, ..., neuron] Axis 1: An entire network, i.e., [layer, ..., layer] Axis 2: A web of identical networks, each connected to the input layer and network corresponding to its own timestep and the next. ​ As such, one would end up with the following 3d structure as hidden layers: [[[neuron, ..., neuron], ..., [neuron, ..., neuron]]t-k, ..., [[neuron, ..., neuron], ..., [neuron, ..., neuron]]t-0] But I can also image that the hidden layers is only a 2d structure. ​ Output In the case of this data, the output is 1d: Axis 0: [x-coordinatet+1, y-coordinatet+1] Only the network corresponding to the last visible timestep will produce usable output. ​ Shematic I have created a schema, in which the arrows denote which element is connected to which: https://ibb.co/LzZM6w6 ​ I can imagine one concern as to why this has not been realized yet; For every added dimension, computational complexity increases quadratically. Is there any other concern and/or is my thinking sound? submitted by /u/Thijs-vW [link] [comments]  ( 5 min )
    Data platform beta test: private beta of customizable schema to fit your dataset formats
    Hi everyone, my name is Taylor and I work at Graviti - We are a cloud data platform for ML practitioners to better and faster manage unstructured data at a large scale. The platform hands developers the ability to do data query, version control, visualization and workflow automation on all types of data based on our powerful compute engine. Now we are launching a private beta of Graviti data platform v3.0 with a new feature -custom schema, which allows you to manage heterogeneous data in a tabular data model and fit your own data formats. Our goal is to find more potential users and receive their honest feedback from the test as well as help us co-build a better data platform for AI and machine learning. We need a group of people from the community who work closely with data in direction of computer vision, NLP, etc, and will be eager to test our data platform, share feedback and help us make it the best fit for more machine learning teams. We appreciate your time and valuable contribution and offer rewards of 3 months of free usage of Graviti data platform(compute included) as well as an Amazon gift card. Interested? Here is our application form. (We will process the application in 48 hours and contact you with further details. ) Feel free to leave comments or any thoughts here. Thank you! submitted by /u/Strong_Bookkeeper_78 [link] [comments]  ( 1 min )
  • Open

    How to Optimize your AI Models at Inference Time
    submitted by /u/aidev2040 [link] [comments]
    AI MSc or Physics PhD?
    So I've got two options to choose from, and I was hoping that you could give your opinion about it. The first option is to do a (three year) PhD in physics. The other option is to do another master's in artificial intelligence. I'll also post this on some physics subreddit. But first a little bit about my background. I've got a BSc and (almost) MSc degree in physics. For my master's thesis, I'm developing machine learning techniques for better posterior estimation for astrophysical data. Obviously, I'm very interested in the further development of ML tools for astro/particle physics applications. The PhD is all about that, developing ML tools for astrophysics. However, I have some concerns about it. First, I am not sure if I want to stay in academia after my PhD. So would it even be a good idea to try to do a PhD? If I want to go into ML industry (my dream would be something like Google) after my PhD, how does a physics PhD heavily focused on ML compares to an AI PhD or MSc? The other option would be to do the AI master's after my physics master's. There are a few things that appeal to me about it. First, I can stay in the city I live currently live in. For the PhD, I would have to move abroad. Moving abroad for three years kinda scares me. Secondly, the diversity of what I will learn will be a lot more compared to the PhD. During the master's you will be taught at a high pace a lot of very interesting cutting edge ML techniques, while during the PhD I will probably focus on a few in more detail. If I would choose to skip the physics PhD and go for the AI MSc, my plan would be to start looking for AI and/or physics PhDs before close to graduating. Maybe important to know: this all takes place in north-west Europe, where one always does a master's before a PhD. Both universities are good (top ~100) but not top-notch. submitted by /u/elipeli54 [link] [comments]  ( 2 min )
    AI Robot Hand Designer Creates Hyper Efficient Manipulators | Self Driving Autonomous Vehicle AI For Spacecraft | Coral Reef AI Measure Ecosystem Health
    submitted by /u/getrich_or_diemining [link] [comments]  ( 1 min )
    In this series, we break down the basics of starting an NLP project from scratch. Check out the blog to learn more.
    submitted by /u/UBIAI [link] [comments]
    Fundamental ethical objection to seeking AGI?
    I came across a philosophical / ethical argument against seeking AGI the other day that i can't see a way past. Its extremely hypothetical with respect to our current progress with AI but I was curious what others make of it. Basically it goes like this. As we make AI more and more sophisticated we gradually scale up the level of consciousness, say its comparable to an insect (maybe we are close now?) to a cat to a chimp to a child, to a grown human etc. Most people would say that the further along this scale you are the more capable of suffering you are and the more rights you should have. So given the 'ease' in which computer programs are run and deleted etc we could foresee that in the quest for AGI we could create and 'kill' billions of entities of comparable consciousness of a chimp or human child. So if it is possible to make an AGI, it will by definition require experimentation on many billions of near AGI, which by definition is morally equivalent to mass experimentation / death of child-like beings. I see huge potential for all forms of AI for making the world better but the above seems unconscionable to me. Obviously this is all in the realm of sci fi now but given most of us here would like to reach some form of AGI, and given we think it is possible at some point how do we hypothetically get round this apparently fundamental issue? submitted by /u/bbbbbadtothe [link] [comments]  ( 3 min )
    AI Dream 53 - Cosmic Birth | Q: Grainy or Smooth?
    submitted by /u/LordPewPew777 [link] [comments]
    Why your AI chatbot doesn't get what you're asking it
    One of the typical reason – the AI chatbot is trained badly or you/conversational designer choose a bad strategy for your chatbot training. Also, when creating a dataset for NLP, the language aspect wasn't taken into account. Suppose you wanted to create a multilingual AI chatbot that speaks languages with different linguistic structures like German and Chinese. These languages have very different language structure from each other. But don't worry, I recently worked on guide that explores the basic steps in chatbot training before actual development and the best practices with conversational AI after the chatbot launch. Please let me know if I didn't cover sth. Read Tips on How to Train a Chatbot for Businesses submitted by /u/Avandegraund [link] [comments]  ( 1 min )
    Why are neural networks not multi-dimensional?
    submitted by /u/Thijs-vW [link] [comments]  ( 1 min )
    What Hugging Face and Microsoft’s collaboration means for applied AI
    submitted by /u/bendee983 [link] [comments]
    Pothole Detector based on YoloV4
    submitted by /u/Gloomy_Recognition_4 [link] [comments]  ( 1 min )
    Google Has Banned the Training of Deepfakes in Colab, See Why?
    submitted by /u/Dip14099 [link] [comments]
    can an ai write a dance review based on other reviews ?
    hi I had the idea to let ai write a dance review based on other reviews and then make a piece out of that, what would I need for that if it is even possible? if you could help me somehow please leave a comment submitted by /u/NolanDeC [link] [comments]  ( 1 min )
    Are there any AI powered tools that could act as my eyes in video games?
    So my friends usually have to help me get through games when I play with them since I am blind IRL, and I am excited for the AI industry to possibly make a tool in the future that would allow me to simply point it at a game window and have it direct me to places, explain what I'm looking at, read item tooltips, atc. Does something like t his exist? For example, if I'm on a game where there is typically an exclamation point to indicate a quest object, I coulda sk the AI to be on the lookout, and when it finds it it would let me know, and then direct me to it? This would be super useful if this is a thing, and I'm eager ot hear what you guys have to say! thanks! submitted by /u/ChipsAhoiMcCoy [link] [comments]  ( 2 min )
    AI Dreams - Dall-E 2, Wumbo AI, and Gigapixel AI - Thoughts on completely AI-Generated original artwork?
    submitted by /u/Dear_Watson [link] [comments]  ( 1 min )
    The Shapes And Patterns You'll Find In Higher Dimensions - 4K Creative Neural-Art Exploration
    submitted by /u/MLInsights [link] [comments]
  • Open

    [R] Solving Bayesian Inverse Problems via Variational Autoencoders
    Paper: https://proceedings.mlr.press/v145/goh22a.html Code: https://github.com/hwangoh/uq-vae This mathematically justified framework offers a data-driven approach to uncertainty quantification for Bayesian inverse problems. When a neural network comes into play, the information contained within a training dataset is embedded into the network; the output of which quantifies the uncertainty in the underlying parameter estimation problem using this information. submitted by /u/hwangoh [link] [comments]  ( 1 min )
    [D] DALL-E 2 Has Its Own Secret Language
    I discovered this through the thread here: Twitter Thread The full paper is here: Paper It seems that the garbled text that Dall-E 2 generates can be run back through the model to produce consistent results. Weird associations like “Apoploe vesrreaitais” meaning a specific type of bird in a lot of different contexts. Really cool find IMO! I guess it makes sense that if Dall-E can’t distinguish between these tokens semantically when generating images, it won’t be able distinguish between them as prompts. Anyone know of similar results / other explanations for this phenomenon? submitted by /u/NMister_ [link] [comments]  ( 2 min )
    [D] How to choose loss function for classification problem?
    Loss function is a penalty function over predictions and true labels. The use of various loss can have impact on the model performance. I wonder if there is a way to choose which loss function works better , in the context of classification problems. submitted by /u/flaubart9 [link] [comments]  ( 1 min )
    [D] [R] Any benchmark for entity-specific sentiment analysis?
    Is there a benchmark for entity-specific sentiment analysis? E.g., like what online sentiment analysis tools like Google Cloud's natural language API or Watson's NLP API does? (Example) E.g., if the sentence is "Google unveiled the new Android phone for $799", the sentiments would be about the entities, Google, Android etc. I'm trying to find out if there's a know benchmark for this task but have been coming up empty despite this task being a staple of all sentiment analysis tools. I am familiar with the aspect based sentiment analysis task (e.g., the SemEval-16 task 4.2). But it seemed to me like it's different from entity-specific sentiment analysis. The targets here are different aspects of the same entity. So, that was making me wonder, on if there's something more specific dataset that deals with entity-specific sentiment rather than aspect-based ones. submitted by /u/RustBucket03 [link] [comments]  ( 1 min )
    [R] FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
    Paper: https://arxiv.org/abs/2205.14135 Twitter: https://twitter.com/tri_dao/status/1531437619791290369?t=UXOZXyk1p9CCrMJLlkDcDg&s=19 Abstract: " Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup. We argue that a missing principle is making attention algorithms IO-aware -- accounting for reads and writes between levels of GPU memory. We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM…  ( 1 min )
    [D] Training audio for ASR in the face of extreme speaker/device imbalance
    Hi, Hope all is well. So, basically, I've recently transitioned into audio related field and was wondering how do people handle several issues regarding machine learning models for audio. The first issue is regarding speaker imbalance. The datasets that I've been working with usually have several prominent speakers that make up the majority of the dataset, and many speakers that contribute, say, less than 1% to the dataset. Assuming ASR (speech recognition) task, should I keep the same proportions, or should I apply various techniques to balance out the dataset? I've heard using zero shot tts (text to speech) for minory speakers to increase via synthetic samples is kind of valuable, but I'm not familiar with other modern approaches for balancing (specifically for audio, I suppose). I guess the same goes with different devices. The various datasets are recorded with various different devices, and, you guessed it, most come from a minority subset. Thanks for your responses in advance :) submitted by /u/Slowai [link] [comments]  ( 1 min )
    [D] Machine learning "for good"?
    I am finishing a Ph.D. in NLP/ML and starting to think about where to go next. I will likely start with an internship / some work in a big company (FAANG or similar, I have already passed some interviews) but if I think of spending my life working to ultimately increase click rates on products /ads I really get depressed. I am extremely attracted to companies with a high technical level but I also need to have the sensation of doing something meaningful (this does not necessarily mean working for an NGO, but there are many open problems worth solving, I believe). I just do not understand why FAANG can really make things work most of the time with the ultimate goal of making billionaires richer, while companies working on important problems are unorganized and ultimately often come up with solutions that are probably technically very far from the optimum. Not that I believe ML is gold and we will get AGI in 5 years (I see the whole marketing bubble) but if Uber can optimize their taxis you can optimize streets to minimize traffic, or the improve the 911 strategy. So (rant over) do you know of any company that deals with an "important" /real problem and has a high technical level? For example anything in health, education, public whatever, NGOs, poverty prevention, etc. submitted by /u/ombelicoInfinito [link] [comments]  ( 6 min )
    [D] mlflow vs kubeflow vs cortex vs Argo from a data scientist perspective?
    Hi I'm looking for a data scientist workflow solution - all the way from experimentation to deployment. But very oriented for a data scientist - which means a very dashboardy out-of-the-box experience. They are probably not interested in creating some custom flows using an SDK or something. Deployment and drift tracking would be nice to have...but not mandatory. I hear a lot about kubeflow...but then people seem to hate it as well. Mlflow is nice..but it seems you need to orchestrate it yourself (or deploy it on k8s anyway). Which one are you using in ur company for your work ? Not for personal projects...but a true org level experience in production? submitted by /u/sandys1 [link] [comments]  ( 1 min )
    [D] Is the KDD 2022 conference worth the $1100 price tag?
    The early bird cost is around~$1100. I work in industry so I'm not a researcher and I was looking to attend some conferences to expand my network. Conferences like ICML or NIPS seem way too theoretical and the people involved are on a completely different level compared to what I do day-to-day. I build xgboost models mostly and occasionally use some standard deep learning methods from blogposts for my work. I was hoping to attend conferences where I can connect with people more on this level instead of actual scientists who use game theory to solve eigenvectors (more power to them but I have little in common with these NASA types). Is the KDD the right conference for me? I'm open to any other suggestions as well. submitted by /u/sybar142857 [link] [comments]  ( 2 min )
    [D] Has anyone trained static word embeddings like fastText on a multilingual corpus, similar to XLM-R or mBERT?
    Training contextual (BERT-style) models on multilingual data seems pretty standard nowadays (XLM-R, mBERT, many more), but I could not find many resources on training static word embeddings like fastText on a multilingual corpus (simply monolingual data from many languages concatenated together). For static embeddings, I most see people aligning embeddings spaces of monolingual embeddings after they were trained. Has anyone tried this or knows some papers where it was tried? I'm curious if it would work or if these bigger types of models are necessary to pull it off. submitted by /u/optimized-adam [link] [comments]  ( 1 min )
    [D]:- NLG: Does it make sense to add a discriminator model on top of pre-trained language generation models like GPT, XLNet, etc?
    Background: I have a dataset of essays and some classes associated with them, the dataset is initially imbalanced, and collecting more data is not an option, so I was looking for Text Generation methods, I am trying to generate text samples that are almost identical to the original samples, so provided, I add keywords and classes as inputs to the generative model Is it necessary to add a discriminative model that predicts whether a sample is real or fake? submitted by /u/Creative_Jellyfish53 [link] [comments]  ( 1 min )
    [R] Multi-Game Decision Transformers
    Blog: https://sites.google.com/view/multi-game-transformers Paper: https://arxiv.org/pdf/2205.15241.pdf ​ https://preview.redd.it/mxritsjhjt291.png?width=1280&format=png&auto=webp&s=fe0a1a97483a0abdb553c849b46d527691fc658e Clarifies quite a lot of findings of GATO in a neat way. Scale helps (as always ;)), transfer learning capabilities are evident:- ... We hence devise our own evaluation setup by pretraining DT, CQL, CPC, BERT, and ACL on the full datasets of the 41 training games with 50M steps each, and fine-tuning one model per held-out game using 1% (500k steps) from each game... It also appears adding more data, whether expert or non-expert still allows DT to gain the edge over Behavioral cloning+expert data. It also achieves super human level performance across 41 games, so catastrophic forgetting seems less relevant and perhaps alleviated by scaling alone... I hope the next paper explores MoEs, they've been quite underappreciated lately. submitted by /u/Competitive-Rub-1958 [link] [comments]  ( 1 min )
    [D] Do you need to grind leetcode for junior or mid level AI/ML Engineer interviews?
    Are AI/ML interviews centred on DSA problem solving found on the well-known websites without much emphasis on specific AI/ML theory or would they feature Pytorch/TensorFlow model creation/training problems? How much emphasis is placed on relevant experience and past projects? submitted by /u/redxammer [link] [comments]  ( 4 min )
    [D] How to measure metric consistency/stability
    Suppose I have some model, a metric and a test set. Now I get some metric value for the whole set. I'm interested to understand the metric distribution across different "areas" of the test set. For example, suppose I have a sentiment analysis model (Negative, Neutral, Positive) and I use accuracy for the metric (suppose the classes are balanced). Also I have a test set of 2 years of twitter posts (2021-2022). Now I may get overall accuracy of 80%, but If look inside, it maybe the case that I have 70% during 2021 and 90% during 2022. Or I may have 80% accuracy overall, but for "politics" posts I have accuracy of 60%. So I want to measure the noise of my metric across some dimensions (not random). Is there any standard way to do so? Or should I just break my data into different dimensions and look at the mean and std? ​ Thanks submitted by /u/sudo_su_ [link] [comments]  ( 1 min )
    [R] BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis
    https://arxiv.org/abs/2205.14807 submitted by /u/scallion000 [link] [comments]  ( 1 min )
    [Project] Predicting customer purchase. One or multiple models?
    Hi all, I am looking for your inputs for a current project in my team within ML/DS. Would appreciate your answers! TLDR; Questions at the bottom. Trying to develop a model. Is it better to have one model for each segment, or one to rule them all? In my team, we are trying to predict whether a customer will buy a device in the next month. We call it the propensity to buy model. At first, we were passing all the available features that are from 1st of the month, as input to the model, with the label being a binary variable indicating whether someone made a purchase anytime between the 1st and 31st. Features include demographics (which I think are useless to be honest), tenure, current subscribed product, etc. Most of which are static, while a a few are dynamic each month. This mo…  ( 4 min )
    [D] Why arxiv-sanity isn't working?
    arxiv-sanity-lite.com (ASL) doesn't even provide the required paper on first or second page. 'Similar' option at ASL works good but 'show similar' on arxiv-sanity is way better. Connectedpapers, zeta-alpha, and research rabbit are other options, but still I like arxiv-sanity. submitted by /u/SAbdusSamad [link] [comments]
    [D] How could one show that a modification to a model yields better convergence?
    Hi, I am thinking about showing that a modification I made to a GAN network yields to better convergence of the training. How can I show this? Is there some path where I can show this mathematically or experimentally? What is your opinion about this? submitted by /u/SeucheAchat9115 [link] [comments]  ( 1 min )
    [D]How to limit the label value while avoiding introducing bias?
    I have a deep neural net model with an integer label to predict. The label is heavily skewed so we cap the labels at some value (let's say 90%ile). Now when we build and run the model, it performs well in general. But in online experiment shows degradation in business metrics for a fraction of clients that have their value capped. If we don't cap the label, the business metrics gets skewed for users with low number of activities. What are my best options to deal with such issue? Adding a new feature? Multi tower learning? Any idea can be super helpful. Thanks. submitted by /u/Which-Distance1384 [link] [comments]  ( 1 min )
    [R] Detecting danger in gridworlds using Gromov's Link Condition
    submitted by /u/tfburns [link] [comments]  ( 2 min )
    [R] Question: Say, I have a sparse vector and I want to reduce the vector to a few descriptive values. What values would you choose?
    I have tried min, max, mean and standard deviation, and sum. What other descriptive statistics would you choose? Would be very, very grateful if your answer is detailed or cited with detailed references/papers. Many thanks. submitted by /u/lal-mohan [link] [comments]  ( 1 min )
    [D] Neural theorem prover and logical reasoning connection
    I have seen works based on Neural theorem prover like End-to-End Differentiable Proving . I am unable to grasp the idea behind theorem proving for logical reasoning task. How can proving a theorem can help us in solving a problem like- Given a situation and rules, what to do to achieve the goal. This paper tried to address this type logical reasoning combining neural network with symbolic reasoning. But I feel lost while reading the approach, which view the whole problem setup as theorem prover or proof generation. If you have an idea about theorem prover and what is its use case in logical reasoning. submitted by /u/projekt_treadstone [link] [comments]  ( 1 min )
    [P] A newsletter for bite-size content about ML/NLP
    We have created a newsletter MLnotes that provides bite-size content about Machine Learning and NLP tips, interviews, and applications across various industries. The goal is not create very long tutorial-like content, and make it fairly short. Therefore, it's good if you're trying to study or refresh your knowledge about ML/NLP. Even if you have no background you'll get the clues to get started. I'd appreciate it if you have any comments or suggestions. submitted by /u/ma1ms [link] [comments]  ( 1 min )
  • Open

    "Towards Learning Universal Hyperparameter Optimizers with Transformers", Chen et al 2022 {G} (Decision Transformer?)
    submitted by /u/gwern [link] [comments]  ( 1 min )
    "Multi-Agent Reinforcement Learning is a Sequence Modeling Problem", Wen et al 2022 (Decision Transformer for MARL: interleave agent choices)
    submitted by /u/gwern [link] [comments]  ( 1 min )
    How do you stay up to date in Reinforcement Learning research?
    Besides following the right companies/people on Twitter and this subreddit, how do you people stay up to date on what is going on Deep/Reinforcement Learning research? What journals to follow, what conferences to attend? I'll leave here a few options, but I would like to know more. - Twitter (for general news, not much for discussions): DeepMind, OpenAI, Hugging Face, Yann LeCunn, Ian Goodfellow, François Chollet, Fei-Fei Li, Andrej Karpathy... - Conferences: ICLR,NeurIPS, ICML, IEEE SaTML, AAAI, AISTATS, AAMAS, COLT... - Eventualy search your favorite researchers/topics on arXiv.org Any podcasts or anything else? submitted by /u/TheKeyZero [link] [comments]  ( 1 min )
    Is it possible to design a multiplicative/exponential reward function? A reward func that depends on current accumulated reward?
    Hey everyone, In the context of my problem, the "true" reward is not additive. Realistically, the more reward the agent has already accumulated, the easier it becomes to accumulate even more. That's to say, the real reward function is partially dependent on previously accumulated reward. Is there any way to implement this kind of dynamic successfully? I have tried to, but for some reason, the agent completely stops learning when I do this. I can implement a linear/additive reward function and the agent does learn good behaviors, but I feel that it's important for the agent to "understand" the true reward dynamic. Essentially, here is the reward function I have: reward = points_gained_this_step And here's the kind of reward that I want (because it actually fully represents the problem): reward = points_gained_this_step*(total_score_so_far) total_score_so_far = total_score_so_far + reward Does anyone have experience implementing something like this successfully? Any advice would be greatly appreciated. EDIT: Do vanishing/exploding gradients have anything to do with it? (Given the exponential growth/decay nature of the rewards in this case) EDIT 2: The "total_score_so_far" is already in my observation space submitted by /u/VladimirB-98 [link] [comments]  ( 2 min )
    "Multi-Game Decision Transformers", Lee et al 2022 {G} (ALE Decision Transformer/Gato: near-human offline single-agent w/scaling & rapid transfer)
    submitted by /u/gwern [link] [comments]  ( 1 min )
    How to generate coordinates in 3D space for tracking problem in drones?
    Hi, I am working on combining deep rl to model predictive control for safe RL in Quadrotors. I am using DDPG method to train a quadrotor for tracking problem. I have many types of reward functions but in each one of them, the drones behave erratically during test time. I think it can be due to the random 3D points I generated while training the agent. Some of the research paper used proper trajectory data (like circular, spiral trajectories) for their tracking problem. I am really confused on how to approach this problem. Because If I use only one trajectory, then the agent will not be able to track other trajectories and also will not be robust to disturbances. Thus, I wanted to know what you think about solution to this problem? How do I generate 3D points of different types of trajectories? (Python code for generating points will also be very helpful). submitted by /u/Better-Ad8608 [link] [comments]  ( 1 min )
    PhD student looking to identify a research topic in RL for controls applications .
    Hello I have been reading through quite a few papers/topics discussing model free vs model based RL etc . Not been able to find something , may be I don't understand it yet to the extent :) . Just for the background : My experience is with Diesel , SI engines , vehicles and controls . One of the topics/areas that seems interesting to me is learning using RL in uncertain scenarios, this might seem to broad for most of the people . Another possible area would be RL for connected vehicles, self driving etc . Any help/suggestion is welcome . submitted by /u/SeasonedLeo [link] [comments]  ( 2 min )
    Question about reward function evaluation
    Is the paper QUANTIFYING DIFFERENCES IN REWARD FUNCTIONS enough to learn about how to evaluate reward function, cuz evaluate reward function is kinda new to me? submitted by /u/Professional_Card176 [link] [comments]  ( 1 min )
    Best maze RL-Solver
    Hey guys, I am looking for the best performing RL algorithm for solving mazes. On top of my head, I'd go with Q-learning or SARSA(lambda). What do you guys think? I don't really find literature specifically on that. maze: simple 2d deterministic grid world submitted by /u/Kartoffelman98 [link] [comments]  ( 1 min )
    How Good is Udacity Deep Learning Nanodegree?
    submitted by /u/MlTut [link] [comments]  ( 1 min )
    SOTA of RL in precise motion control of robot
    Hi, when training an agent and evaluating the trained agent, I have realized that the agent tends to show slightly different behavior/performance even if the goal remains the same. I believe this is due to the stochastic nature of RL. But, how can this agent be then transferred to the reality, when the goal lies for example in the precise control of a robot? Are you aware of any RL work that deals with the real robot for precise motion controlling? (for instance, precisely placing the robot's tool at the goal position) submitted by /u/Fun-Moose-3841 [link] [comments]  ( 1 min )
    Inverse Reinforcement Learning: Continuous Action-State Space
    I am pretty new to Reinforcement Learning. I am currently working on a project where I need to find the reward function of gamblers' behavior. I have a dataset of each player playing on a casino machine which stores each transaction performed by the user and related information. I can have one action as how much the user has bid, and for the state, I was thinking to have transaction time and the amount in the machine on each transaction. I couldn't find many implementations on the internet about a related problem. I was planning to use maxEntropy IRL or PL-IRL, but have no clue on how to implement these algorithms for my problem. Could someone help me with how I should approach this problem or suggest a way in which I can implement these models? submitted by /u/gjariwala9 [link] [comments]  ( 1 min )
  • Open

    Seamlessly connect Amazon Athena with Amazon Lookout for Metrics to detect anomalies
    Amazon Lookout for Metrics is an AWS service that uses machine learning (ML) to automatically monitor the metrics that are most important to businesses with greater speed and accuracy. The service also makes it easier to diagnose the root cause of anomalies, such as unexpected dips in revenue, high rates of abandoned shopping carts, spikes […]  ( 7 min )
  • Open

    Introducing the Data Product Development Canvas (Version 1.0)
    I think one of the most important data consumption developments in the age of Big Data and AI is the concept of a Data Product. Data Products are a category of domain-infused, AI/ML-powered apps designed to help non-technical users manage data and analytics-intensive operations to achieve specific, meaningful, and relevant business outcomes. Some key aspects… Read More »Introducing the Data Product Development Canvas (Version 1.0) The post Introducing the Data Product Development Canvas (Version 1.0) appeared first on Data Science Central.  ( 6 min )
  • Open

    The Closer: Machine Learning Helps Banks, Buyers Finalize Real Estate Transactions
    The home-buying process can feel like an obstacle course — finding the perfect place, putting together an offer and, the biggest hurdle of all, securing a mortgage. San Francisco-based real-estate technology company Doma is helping prospective homeowners clear that hurdle more quickly with the support of AI. Its machine learning models accelerate properties through the Read article > The post The Closer: Machine Learning Helps Banks, Buyers Finalize Real Estate Transactions appeared first on NVIDIA Blog.  ( 3 min )
    Fantastical 3D Creatures Roar to Life ‘In the NVIDIA Studio’ With Artist Massimo Righi
    The year of the tiger comes into focus this week In the NVIDIA Studio, which welcomes 3D creature artist Massimo Righi. An award-winning 3D artist with two decades of experience in the film industry, Righi has received multiple artist-of-the-month accolades and features in top creative publications. The post Fantastical 3D Creatures Roar to Life ‘In the NVIDIA Studio’ With Artist Massimo Righi appeared first on NVIDIA Blog.  ( 4 min )
  • Open

    DoWhy evolves to independent PyWhy model to help causal inference grow
    Identifying causal effects is an integral part of scientific inquiry. It helps us understand everything from educational outcomes to the effects of social policies to risk factors for diseases. Questions of cause-and-effect are also critical for the design and data-driven evaluation of many technological systems we build today.  To help data scientists better understand and […] The post DoWhy evolves to independent PyWhy model to help causal inference grow appeared first on Microsoft Research.  ( 7 min )
  • Open

    How does AI Help Doctors, Physicians, and Healthcare Setups?
    Despite the more vocal impacts, Artificial Intelligence hasn’t had a seamless entrance across verticals. Yet, AI-based resources in…  ( 3 min )
    The Mind of Machine and Men
    How“Can’t Help Myself” helps us understand consciousness and empathy Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 6 min )
    The Many Implications of Human Cloning
    “Clones are organisms that are exact genetic copies of any living organism. Every single bit of their DNA is identical.”  ( 5 min )

  • Open

    How do I save ram usage using the Breakout dataset from gin
    I tried to import the episodes i needed (1000000 stacks of 4 84x84 grayscale images) using d4rl-atari, but that destroys the RAM limitations I have (16GB). Has anyone worked with this dataset that knows how I can save on RAM? submitted by /u/PM_ME_FREE_GAMES [link] [comments]  ( 1 min )
    "Multitasking Inhibits Semantic Drift", Jacob et al 2021
    submitted by /u/gwern [link] [comments]
    Why do I love Reinforcement Learning?
    Just a generic read, please leave your thoughts/feedback. https://medium.com/@vishalgarg652/why-do-i-love-reinforcement-learning-5c0de2abf7e4 submitted by /u/vishalgarg652 [link] [comments]  ( 1 min )
  • Open

    Last Week in AI: Royal Mail will deliver with drones, backdoors in deep learning, AI for weather forecasts, and more!
    submitted by /u/regalalgorithm [link] [comments]
    The Ultimate Disco Diffusion Tutorial ... EVERY Feature ... EXPLAINED
    submitted by /u/JoshGrambo [link] [comments]
    AI Dream 53 - Cosmic Birth | MASTERPIECE FINAL RAW
    submitted by /u/LordPewPew777 [link] [comments]
    The Limits of Automation
    submitted by /u/glaringconstraint [link] [comments]
    Can self learning ai software be created as a hardware emulator that automatically replicates any detected computer hardware?
    I was just thinking if its possible to have a server with self learning ai software that functions like a hardware emulator. The ai would automatically detected any hardware, replicate it, save the information, and store it fore later use. submitted by /u/PlankOfWoood [link] [comments]  ( 1 min )
    No, GPT-3 Cannot Click On Links
    submitted by /u/drcopus [link] [comments]  ( 1 min )
    Fractal - 4K AI Art Visualization
    submitted by /u/MLInsights [link] [comments]
    Learning path to become an AI engineer?
    From my understanding the learning path should be: Data Engineer (good knowledge of SQL, Understanding of ETL, Big Data Analytics Tools such as Cassandra, Hive, Apache Spark) Data Analyst (Exploratory data Analysis - PowerBI) Data Scientist / AI engineer (ML algos , DL algos, business use cases for NLP, CV). Where I stand today: I have a bachelor's degree in comp. science. and decent knowledge of Web Development (HTML, CSS, JS), Cloud computing (AWS) and DevOps (Git, Docker, Kubernetes). Pertaining to skills required for Data Science & AI - Math: I am good at it SQL: I have decent enough understanding of SQL. MSSQL and MySQL, but I haven't tried Big Data tools and also lack interest in those PowerBI: I am good at PowerBI - DAX, nice layouts, good design sense but not ve…  ( 2 min )
    Hugging Face Endpoints on Azure | Rubik's Code
    submitted by /u/RubiksCodeNMZ [link] [comments]
    Will you let artificial intelligence take your job?
    Direct answers only. View Poll submitted by /u/trillswan [link] [comments]  ( 1 min )
    Great video...The AI revolution is on its way...new industrial revolution coming soon...
    submitted by /u/the_anonymizer [link] [comments]
    Tesseract - 4K Neural Art Visualization
    submitted by /u/MLInsights [link] [comments]
  • Open

    A step forward in CIKM 2022 review process [D]
    As the title says, the email to reviewers explicitly states that there are no preconceived limits on the "acceptance rate", please do a good job reviewing. How this statement stands up for the final decisions, who knows, but it is a good step forward. submitted by /u/not_novel_enough [link] [comments]  ( 1 min )
    [D] Experience with Vertex AI?
    Have you used Google's Vertex AI for model deployment and management in production? If so, what is your experience with it for maintaining ML pipelines? We are considering it at my work and was curious to know about the difficulties and challenges. Feel free to share your thoughts on other frameworks/tools as well. Thanks in advance. submitted by /u/zxqkv [link] [comments]  ( 1 min )
    "[Discussion]", "[D]" New Data Science Interview Reddit Community
    Hi All, I started a new Reddit community for people to share data science interview experiences and tips at https://www.reddit.com/r/DataScienceInterview/. submitted by /u/PythonDataScientist [link] [comments]
    [D] Datasets and Models for Structured Information Extraction on HTML
    I like the idea of basically summarizing an HTML to the markup schema one is interested in. So I recently stumbled upon the WebFormer paper (https://arxiv.org/pdf/2202.00217v1.pdf) and wanted to try it out for different categories of structured information. As often, neither the weights nor the code is available. When it comes to datasets, it seems like there are not many options. The SWDE dataset used in the paper is not available anymore. I wasn't able to find any other appropriate dataset with HTML to Markup. Web Data Commons provides a subset of Common Crawl with sites that have Markup data, but I still need to investigate this further. When it comes to models, I would love to start with some pre-trained models as a baseline as I want to investigate how little data I can start with to see any reasonable results with fine-tuning. Now the question is, what model would be best for HTML based tokens? I would love to hear your suggestions for models and datasets to approach this problem. submitted by /u/theamaru [link] [comments]  ( 1 min )
    [D] Finding an optimal threshold in multi-class classification problem
    For binary class problems, finding a probability threshold would include trying different threshold values and using the threshold that yields the highest accuracy/precision/recall depending on the most relevant metric for the problem. Considering multi-class classification algorithms would output N probabilities does it make sense to have a common threshold for all classes or to treat each class individually and get N different thresholds? If so how does this scale with a high number of classes? Are there better techniques than these? Few pointers about the use case: Number of classes >10000. Cost of misclassification of all classes is equal. I want to also be able to output "unidentified" if no class probability crosses the respective threshold. submitted by /u/BuddhaSadhu666 [link] [comments]  ( 2 min )
    [D] Comparison of TPU vs GPU for fast & cheap inference in 2022?
    I'm trying to understand how TPUs and GPUs compare for inference (not training!), in terms of (a) financial cost and (b) speed. Does anyone know the answer, or could anyone point me towards some blog post with the answer? Many of the resources I've found are sadly 2-4 years out of date, and I'd ideally like a more recent, authoritative answer. submitted by /u/RSchaeffer [link] [comments]  ( 1 min )
    [P] Dataset for online news discussions summarization
    I'm currently working on a project on abstractive summarization of online news discussion. I have found in the literature two relative datasets for this task: A subset of the conversation benchmark dataset from ConvoSumm (2021), and SENSEI. Both include comments relating to a news article, as well as a human curated summary of the form "Some commenters noted that.. One commenter argued..". Since both contain a limited amount of examples (500 and 18 respectively), I was curious if you were aware of a similar available dataset in literature. Moreover, I'm tackling this task from a multi-document summarization perspective, so any datasets from that area that could serve as an auxiliary learning task would be welcome. Thank you in advance! submitted by /u/Kounelly94 [link] [comments]  ( 1 min )
    [P] Generate cryptographic encodings using GANs
    Hi guys, I was working on a cryptographic problem for which I need to make a GAN learn how to generate RSA encodings. Example input: b'GxHqH5c3fB/TEQispLvYByl5iPmiYLFq2ZyDQqfNbOpt4UOUWOOI4ZyAd0dWHSdTBhGmf8Psa9Ivo6tEbtAu1BZKIE1m1FwGL38FO6HgmTQnpeoJTqPaadq9wdax7TF3XZJFBeqNRChIkePcEt3yComXEKA8gOY2FnlSFo8jTRY=' Example output: b'CRMV1iuaCpuQgxctqob1CIoUbm3twa85Ahi8HKrm7gZjXS3NA5rW60/XNn5uWZ1OAUBXQAIn3e7m3s1Bs6kOV7MjQyqcf5M8KPx+TtifUbNgO8ENUboCqOXOv3/aLwaAbNvRaCucOox6sh70rWXKUzVD6jKZZ7sBWC2VDnklO7Y=' My main goal is to make the model overfit the training data as well as it should be able to generate new encodings for unseen data. Is this thing even possible? If yes what kind of GANs should I use. Also, what would be a good generator and discriminator network in this case? Any help would be appreciated. Thanks. submitted by /u/NoAct7818 [link] [comments]  ( 2 min )
    [R] LayoutLM Word-patch alignment pre-training
    Hi, So LayoutLm V3 was pretained to predict, for each unmasked text token, if the corresponding visual tokens were masked or not. But they exclude the masked text token for this task Anyone understand why they do that ? Thanks in advance submitted by /u/Meddhouib10 [link] [comments]  ( 1 min )
    [D] What do you value in a paper replication?
    Context: Recently read a paper from a few years back that I thought was pretty cool. Ended up replicating the implementation on github, because (1) I believe the idea should be made more accessible, and (2) as good old fashioned practice. Throughout the time spent working on it, replicating training results was dead last in priority, and I nearly forgot about it before considering the exercise complete. Thus my curiosity: r/MachineLearning, what do you value in a paper replication? P.S.: Might as well link the repo while I'm here. Happy to hear any feedback! submitted by /u/coffee869 [link] [comments]  ( 3 min )
    [D] Build, train and track your ML models with a few lines of code
    Today, Layer goes open-source to make machine learning more accessible and contribute to ML's growth and evolution. Machine Learning is becoming the default way to build technology. It's how you make your apps smarter, your systems more reliable and your businesses smarter. This is mostly possible by the open-science efforts; from open-source ML frameworks to open datasets. We will open-source more including our roadmap. Meanwhile, check out our repo, and don't forget to give us a star! https://github.com/layerai/sdk submitted by /u/mwitiderrick [link] [comments]  ( 1 min )
    [R] Why Robust Generalization in Deep Learning is Difficult: Perspective of Expressive Power
    submitted by /u/hardmaru [link] [comments]  ( 1 min )
    [D] Maintaining documentation with live results from experiments
    Machine Learning projects are heavily data-driven, so documentation needs to contain the outcome of the experiments as well. While I stumbled upon many tools to keep track of experiments in ML (e.g., MLFlow, W&B, etc.), I couldn't find anything to keep documentation of all the experiments with results without having to manually copy-paste. Ideally, I should be able to maintain a live page with results that the entire team can see. What do you all use for documentation purposes? submitted by /u/mighty-dude [link] [comments]  ( 1 min )
    [D] using git or other tools to manage models
    Trying to understand what's the best way to keep track of all the models I'm generating. Playing with git lfs but it seems complicated and not always a perfect match for the problem. I suspect at larger companies there's database support for this, probably associated with an S3 type data store. Looking for ideas about how to do this as an individual developer without too much trouble. submitted by /u/danbmil99 [link] [comments]  ( 1 min )
  • Open

    Conspicuously missing data
    I was working on a report for a client this afternoon when I remembered this comic from Spiked Math. I needed to illustrate the point that revealing information about one person or group can reveal information on other people or other groups. If you give your genetic information to a company, for example, you also […] Conspicuously missing data first appeared on John D. Cook.  ( 1 min )
  • Open

    How Big Data Can Transform Talent Management
    The world is facing a reskilling emergency. World Economic Forum highlights that over 54% of people will need to reskill and upskill themselves in the next 3 years. Technological advances are upending how work is done. As the world of work changes, the role of the human resource department will no longer be limited to… Read More »How Big Data Can Transform Talent Management The post How Big Data Can Transform Talent Management appeared first on Data Science Central.  ( 4 min )
    How IoT Technology Is Improving the Future of Transportation
    IoT stands for ‘Internet of things’ and can be explained as the network of interconnected devices that share data over the communication networks resembling the spider’s web. Everyday objects like kitchen appliances, cars, fitness devices, and smartwatches are connected to the internet by embedded devices that provide smooth communication between people, things, the environment, and… Read More »How IoT Technology Is Improving the Future of Transportation The post How IoT Technology Is Improving the Future of Transportation appeared first on Data Science Central.  ( 4 min )
    Countering Data Tech’s Cheap Speech
    Law professor Eugene Volokh was apparently the first person to popularize the term cheap speech, in an article for Yale Law Review in 1995. Recently, Law professor Rick Hasen has been promoting a new book of his own titled Cheap Speech. His definition of cheap speech expands on Volokh’s definition. Quoting directly here from a… Read More »Countering Data Tech’s Cheap Speech The post Countering Data Tech’s Cheap Speech appeared first on Data Science Central.  ( 4 min )
    Could ABBAtars be the business model for the metaverse and 5G?
    Last week, the 80s pop group ABBA performed a ‘hologram concert’ based on what they called as ‘ABBAtars’ By all measures in the media, it was very successful From a technological perspective, could it offer a ‘killer app’ for 5G and the Metaverse? Firstly, a hologram concert is not a hologram as we know it… Read More »Could ABBAtars be the business model for the metaverse and 5G? The post Could ABBAtars be the business model for the metaverse and 5G? appeared first on Data Science Central.  ( 2 min )
    Does Your Business Require AI For Dedicated Internet Access?
    Internet connectivity has few available options. So, the choice comes down to a dedicated internet connection or broadband. The choice depends on security, cost, and performance.  The speed and quality of broadband depend on the traffic that goes through the network at a given time. So, dedicated internet is the solution for businesses that need… Read More »Does Your Business Require AI For Dedicated Internet Access? The post Does Your Business Require AI For Dedicated Internet Access? appeared first on Data Science Central.  ( 5 min )
    How to Protect Your Cloud from Cyberattacks During and After Migration
    To reap all the benefits of cloud computing technology, it’s important to secure the cloud during and after migration. The post How to Protect Your Cloud from Cyberattacks During and After Migration appeared first on Data Science Central.  ( 5 min )
    How 3 Key Ecommerce Metrics Can Inform Your Data Analysis
    Ecommerce is a cutthroat industry, and it’s only getting more competitive. The post How 3 Key Ecommerce Metrics Can Inform Your Data Analysis appeared first on Data Science Central.  ( 5 min )
    The Technological Arms Race of Software Licensing
    Software licensing exists in the space between digital services and the real world. Often, a license will denote the number of active users permitted to use the software, as well as their ability to manipulate the source code given. As such, it is a contract often enforced by the software itself, with checks to ensure… Read More »The Technological Arms Race of Software Licensing The post The Technological Arms Race of Software Licensing appeared first on Data Science Central.  ( 4 min )
  • Open

    NVIDIA Accelerates AI, Digital Twins, Quantum Computing and Edge HPC at ISC 2022
    Researchers grappling with today’s grand challenges are getting traction with accelerated computing, as showcased at ISC, Europe’s annual gathering of supercomputing experts. Some are building digital twins to simulate new energy sources. Some use AI+HPC to peer deep into the human brain. Others are taking HPC to the edge with highly sensitive instruments or accelerating Read article > The post NVIDIA Accelerates AI, Digital Twins, Quantum Computing and Edge HPC at ISC 2022 appeared first on NVIDIA Blog.  ( 4 min )
    The Man With 100,000 Brains: AI’s Big Donation to Science
    Jorge Cardoso wears many hats, and that’s appropriate given he has so many brains. A hundred thousand of them to be exact. Cardoso is a teacher, a CTO, an entrepreneur, a founding member of the MONAI open source consortium and a researcher in AI for medical imaging. In that last role, Cardoso and his team Read article > The post The Man With 100,000 Brains: AI’s Big Donation to Science appeared first on NVIDIA Blog.  ( 3 min )
    The Road to the Hybrid Quantum-HPC Data Center Starts Here
    It’s time to start building tomorrow’s hybrid quantum computers. The motivation is compelling, the path is clear and key components for the job are available today. Quantum computing has the potential to bust through some of today’s toughest challenges, advancing everything from drug discovery to weather forecasting. In short, quantum computing will play a huge Read article > The post The Road to the Hybrid Quantum-HPC Data Center Starts Here appeared first on NVIDIA Blog.  ( 4 min )
    Scientists Building Digital Twins in NVIDIA Omniverse to Accelerate Clean Energy Research
    As global climate change accelerates, finding and securing clean energy is a crucial challenge for many researchers, organizations and governments. The U.K.’s Atomic Energy Authority (UKAEA), through an evaluation project at the University of Manchester, has been testing the NVIDIA Omniverse simulation platform to accelerate the design and development of a full-scale fusion powerplant that Read article > The post Scientists Building Digital Twins in NVIDIA Omniverse to Accelerate Clean Energy Research appeared first on NVIDIA Blog.  ( 4 min )
    HPC Researchers Seed the Future of In-Network Computing With NVIDIA BlueField DPUs
    Across Europe and the U.S., HPC developers are supercharging supercomputers with the power of Arm cores and accelerators inside NVIDIA BlueField-2 DPUs. At Los Alamos National Laboratory (LANL) that work is one part of a broad, multiyear collaboration with NVIDIA that targets 30x speedups in computational multi-physics applications. LANL researchers foresee significant performance gains using Read article > The post HPC Researchers Seed the Future of In-Network Computing With NVIDIA BlueField DPUs appeared first on NVIDIA Blog.  ( 4 min )
    Hyperscale Digital Twins to Give Us “Amazing Superpowers,” NVIDIA Exec Says at ISC 2022
    Highly accurate digital representations of physical objects or systems, or “digital twins,” will enable the next era of industrial virtualization and AI, executives from NVIDIA and BMW said Tuesday. Kicking off the ISC 2022 conference in Hamburg, Germany, NVIDIA’s Rev Lebaredian (left), vice president for Omniverse and simulation technology, was joined by Michele Melchiorre, senior Read article > The post Hyperscale Digital Twins to Give Us “Amazing Superpowers,” NVIDIA Exec Says at ISC 2022 appeared first on NVIDIA Blog.  ( 5 min )
  • Open

    A.I. Plays Would You Rather
    submitted by /u/BasicallyJustASpider [link] [comments]
    Hugging Face Endpoints on Azure
    submitted by /u/RubiksCodeNMZ [link] [comments]
  • Open

    Approximating the Manifold Structure of Attributed Incentive Salience from Large Scale Behavioural Data. A Representation Learning Approach Based on Artificial Neural Networks. (arXiv:2108.01724v2 [cs.LG] UPDATED)
    Incentive salience attribution can be understood as a psychobiological mechanism ascribing relevance to potentially rewarding objects and actions. Despite being an important component of the motivational process guiding our everyday behaviour its study in naturalistic contexts is not straightforward. Here we propose a methodology based on artificial neural networks (ANNs) for approximating latent states produced by this process in situations where large volumes of behavioural data are available but no experimental control is possible. Leveraging knowledge derived from theoretical and computational accounts of incentive salience attribution we designed an ANN for estimating duration and intensity of future interactions between individuals and a series of video games in a large-scale ($N> 3 \times 10^6$) longitudinal dataset. We found video games to be the ideal context for developing such methodology due to their reliance on reward mechanics and their ability to provide ecologically robust behavioural measures at scale. When compared to competing approaches our methodology produces representations that are better suited for predicting the intensity future behaviour and approximating some functional properties of attributed incentive salience. We discuss our findings with reference to the adopted theoretical and computational frameworks and suggest how our methodology could be an initial step for estimating attributed incentive salience in large scale behavioural studies.  ( 2 min )
    Near-Minimax Optimal Estimation With Shallow ReLU Neural Networks. (arXiv:2109.08844v2 [stat.ML] UPDATED)
    We study the problem of estimating an unknown function from noisy data using shallow ReLU neural networks. The estimators we study minimize the sum of squared data-fitting errors plus a regularization term proportional to the squared Euclidean norm of the network weights. This minimization corresponds to the common approach of training a neural network with weight decay. We quantify the performance (mean-squared error) of these neural network estimators when the data-generating function belongs to the second-order Radon-domain bounded variation space. This space of functions was recently proposed as the natural function space associated with shallow ReLU neural networks. We derive a minimax lower bound for the estimation problem for this function space and show that the neural network estimators are minimax optimal up to logarithmic factors. This minimax rate is immune to the curse of dimensionality. We quantify an explicit gap between neural networks and linear methods (which include kernel methods) by deriving a linear minimax lower bound for the estimation problem, showing that linear methods necessarily suffer the curse of dimensionality in this function space. As a result, this paper sheds light on the phenomenon that neural networks seem to break the curse of dimensionality.  ( 2 min )
    Characterizing the robustness of Bayesian adaptive experimental designs to active learning bias. (arXiv:2205.13698v1 [stat.ME])
    Bayesian adaptive experimental design is a form of active learning, which chooses samples to maximize the information they give about uncertain parameters. Prior work has shown that other forms of active learning can suffer from active learning bias, where unrepresentative sampling leads to inconsistent parameter estimates. We show that active learning bias can also afflict Bayesian adaptive experimental design, depending on model misspecification. We develop an information-theoretic measure of misspecification, and show that worse misspecification implies more severe active learning bias. At the same time, model classes incorporating more "noise" - i.e., specifying higher inherent variance in observations - suffer less from active learning bias, because their predictive distributions are likely to overlap more with the true distribution. Finally, we show how these insights apply to a (simulated) preference learning experiment.  ( 2 min )
    Towards a Unified Framework for Uncertainty-aware Nonlinear Variable Selection with Theoretical Guarantees. (arXiv:2204.07293v2 [stat.ML] UPDATED)
    We develop a simple and unified framework for nonlinear variable selection that incorporates uncertainty in the prediction function and is compatible with a wide range of machine learning models (e.g., tree ensembles, kernel methods, neural networks, etc). In particular, for a learned nonlinear model $f(\mathbf{x})$, we consider quantifying the importance of an input variable $\mathbf{x}^j$ using the integrated partial derivative $\Psi_j = \Vert \frac{\partial}{\partial \mathbf{x}^j} f(\mathbf{x})\Vert^2_{P_\mathcal{X}}$. We then (1) provide a principled approach for quantifying variable selection uncertainty by deriving its posterior distribution, and (2) show that the approach is generalizable even to non-differentiable models such as tree ensembles. Rigorous Bayesian nonparametric theorems are derived to guarantee the posterior consistency and asymptotic uncertainty of the proposed approach. Extensive simulations and experiments on healthcare benchmark datasets confirm that the proposed algorithm outperforms existing classic and recent variable selection methods.  ( 2 min )
    Minimax Regret for Cascading Bandits. (arXiv:2203.12577v2 [cs.LG] UPDATED)
    Cascading bandits is a natural and popular model that frames the task of learning to rank from Bernoulli click feedback in a bandit setting. For the case of unstructured rewards, we prove matching upper and lower bounds for the problem-independent (i.e., gap-free) regret, both of which strictly improve the best known. A key observation is that the hard instances of this problem are those with small mean rewards, i.e., the small click-through rates that are most relevant in practice. Based on this, and the fact that small mean implies small variance for Bernoullis, our key technical result shows that variance-aware confidence sets derived from the Bernstein and Chernoff bounds lead to optimal algorithms (up to log terms), whereas Hoeffding-based algorithms suffer order-wise suboptimal regret. This sharply contrasts with the standard (non-cascading) bandit setting, where the variance-aware algorithms only improve constants. In light of this and as an additional contribution, we propose a variance-aware algorithm for the structured case of linear rewards and show its regret strictly improves the state-of-the-art.  ( 2 min )
    Benign Overparameterization in Membership Inference with Early Stopping. (arXiv:2205.14055v1 [cs.LG])
    Does a neural network's privacy have to be at odds with its accuracy? In this work, we study the effects the number of training epochs and parameters have on a neural network's vulnerability to membership inference (MI) attacks, which aim to extract potentially private information about the training data. We first demonstrate how the number of training epochs and parameters individually induce a privacy-utility trade-off: more of either improves generalization performance at the expense of lower privacy. However, remarkably, we also show that jointly tuning both can eliminate this privacy-utility trade-off. Specifically, with careful tuning of the number of training epochs, more overparameterization can increase model privacy for fixed generalization error. To better understand these phenomena theoretically, we develop a powerful new leave-one-out analysis tool to study the asymptotic behavior of linear classifiers and apply it to characterize the sample-specific loss threshold MI attack in high-dimensional logistic regression. For practitioners, we introduce a low-overhead procedure to estimate MI risk and tune the number of training epochs to guard against MI attacks.  ( 2 min )
    How Tempering Fixes Data Augmentation in Bayesian Neural Networks. (arXiv:2205.13900v1 [cs.LG])
    While Bayesian neural networks (BNNs) provide a sound and principled alternative to standard neural networks, an artificial sharpening of the posterior usually needs to be applied to reach comparable performance. This is in stark contrast to theory, dictating that given an adequate prior and a well-specified model, the untempered Bayesian posterior should achieve optimal performance. Despite the community's extensive efforts, the observed gains in performance still remain disputed with several plausible causes pointing at its origin. While data augmentation has been empirically recognized as one of the main drivers of this effect, a theoretical account of its role, on the other hand, is largely missing. In this work we identify two interlaced factors concurrently influencing the strength of the cold posterior effect, namely the correlated nature of augmentations and the degree of invariance of the employed model to such transformations. By theoretically analyzing simplified settings, we prove that tempering implicitly reduces the misspecification arising from modeling augmentations as i.i.d. data. The temperature mimics the role of the effective sample size, reflecting the gain in information provided by the augmentations. We corroborate our theoretical findings with extensive empirical evaluations, scaling to realistic BNNs. By relying on the framework of group convolutions, we experiment with models of varying inherent degree of invariance, confirming its hypothesized relationship with the optimal temperature.  ( 2 min )
    Why Robust Generalization in Deep Learning is Difficult: Perspective of Expressive Power. (arXiv:2205.13863v1 [cs.LG])
    It is well-known that modern neural networks are vulnerable to adversarial examples. To mitigate this problem, a series of robust learning algorithms have been proposed. However, although the robust training error can be near zero via some methods, all existing algorithms lead to a high robust generalization error. In this paper, we provide a theoretical understanding of this puzzling phenomenon from the perspective of expressive power for deep neural networks. Specifically, for binary classification problems with well-separated data, we show that, for ReLU networks, while mild over-parameterization is sufficient for high robust training accuracy, there exists a constant robust generalization gap unless the size of the neural network is exponential in the data dimension $d$. Even if the data is linear separable, which means achieving low clean generalization error is easy, we can still prove an $\exp({\Omega}(d))$ lower bound for robust generalization. Moreover, we establish an improved upper bound of $\exp({\mathcal{O}}(k))$ for the network size to achieve low robust generalization error when the data lies on a manifold with intrinsic dimension $k$ ($k \ll d$). Nonetheless, we also have a lower bound that grows exponentially with respect to $k$ -- the curse of dimensionality is inevitable. By demonstrating an exponential separation between the network size for achieving low robust training and generalization error, our results reveal that the hardness of robust generalization may stem from the expressive power of practical models.
    Inference and Sampling for Archimax Copulas. (arXiv:2205.14025v1 [stat.ME])
    Understanding multivariate dependencies in both the bulk and the tails of a distribution is an important problem for many applications, such as ensuring algorithms are robust to observations that are infrequent but have devastating effects. Archimax copulas are a family of distributions endowed with a precise representation that allows simultaneous modeling of the bulk and the tails of a distribution. Rather than separating the two as is typically done in practice, incorporating additional information from the bulk may improve inference of the tails, where observations are limited. Building on the stochastic representation of Archimax copulas, we develop a non-parametric inference method and sampling algorithm. Our proposed methods, to the best of our knowledge, are the first that allow for highly flexible and scalable inference and sampling algorithms, enabling the increased use of Archimax copulas in practical settings. We experimentally compare to state-of-the-art density modeling techniques, and the results suggest that the proposed method effectively extrapolates to the tails while scaling to higher dimensional data. Our findings suggest that the proposed algorithms can be used in a variety of applications where understanding the interplay between the bulk and the tails of a distribution is necessary, such as healthcare and safety.
    Dual Convexified Convolutional Neural Networks. (arXiv:2205.14056v1 [cs.LG])
    We propose the framework of dual convexified convolutional neural networks (DCCNNs). In this framework, we first introduce a primal learning problem motivated from convexified convolutional neural networks (CCNNs), and then construct the dual convex training program through careful analysis of the Karush-Kuhn-Tucker (KKT) conditions and Fenchel conjugates. Our approach reduces the memory overhead of constructing a large kernel matrix and eliminates the ambiguity of factorizing the matrix. Due to the low-rank structure in CCNNs and the related subdifferential of nuclear norms, there is no closed-form expression to recover the primal solution from the dual solution. To overcome this, we propose a highly novel weight recovery algorithm, which takes the dual solution and the kernel information as the input, and recovers the linear and convolutional weights of a CCNN. Furthermore, our recovery algorithm exploits the low-rank structure and imposes a small number of filters indirectly, which reduces the parameter size. As a result, DCCNNs inherit all the statistical benefits of CCNNs, while enjoying a more formal and efficient workflow.
    Global Convergence of Over-parameterized Deep Equilibrium Models. (arXiv:2205.13814v1 [cs.LG])
    A deep equilibrium model (DEQ) is implicitly defined through an equilibrium point of an infinite-depth weight-tied model with an input-injection. Instead of infinite computations, it solves an equilibrium point directly with root-finding and computes gradients with implicit differentiation. The training dynamics of over-parameterized DEQs are investigated in this study. By supposing a condition on the initial equilibrium point, we show that the unique equilibrium point always exists during the training process, and the gradient descent is proved to converge to a globally optimal solution at a linear convergence rate for the quadratic loss function. In order to show that the required initial condition is satisfied via mild over-parameterization, we perform a fine-grained analysis on random DEQs. We propose a novel probabilistic framework to overcome the technical difficulty in the non-asymptotic analysis of infinite-depth weight-tied models.  ( 2 min )
    Privacy of Noisy Stochastic Gradient Descent: More Iterations without More Privacy Loss. (arXiv:2205.13710v1 [cs.LG])
    A central issue in machine learning is how to train models on sensitive user data. Industry has widely adopted a simple algorithm: Stochastic Gradient Descent with noise (a.k.a. Stochastic Gradient Langevin Dynamics). However, foundational theoretical questions about this algorithm's privacy loss remain open -- even in the seemingly simple setting of smooth convex losses over a bounded domain. Our main result resolves these questions: for a large range of parameters, we characterize the differential privacy up to a constant factor. This result reveals that all previous analyses for this setting have the wrong qualitative behavior. Specifically, while previous privacy analyses increase ad infinitum in the number of iterations, we show that after a small burn-in period, running SGD longer leaks no further privacy. Our analysis departs completely from previous approaches based on fast mixing, instead using techniques based on optimal transport (namely, Privacy Amplification by Iteration) and the Sampled Gaussian Mechanism (namely, Privacy Amplification by Sampling). Our techniques readily extend to other settings, e.g., strongly convex losses, non-uniform stepsizes, arbitrary batch sizes, and random or cyclic choice of batches.
    A Sea of Words: An In-Depth Analysis of Anchors for Text Data. (arXiv:2205.13789v1 [stat.ML])
    Anchors [Ribeiro et al. (2018)] is a post-hoc, rule-based interpretability method. For text data, it proposes to explain a decision by highlighting a small set of words (an anchor) such that the model to explain has similar outputs when they are present in a document. In this paper, we present the first theoretical analysis of Anchors, considering that the search for the best anchor is exhaustive. We leverage this analysis to gain insights on the behavior of Anchors on simple models, including elementary if-then rules and linear classifiers.
    Auditing Differential Privacy in High Dimensions with the Kernel Quantum R\'enyi Divergence. (arXiv:2205.13941v1 [cs.LG])
    Differential privacy (DP) is the de facto standard for private data release and private machine learning. Auditing black-box DP algorithms and mechanisms to certify whether they satisfy a certain DP guarantee is challenging, especially in high dimension. We propose relaxations of differential privacy based on new divergences on probability distributions: the kernel R\'enyi divergence and its regularized version. We show that the regularized kernel R\'enyi divergence can be estimated from samples even in high dimensions, giving rise to auditing procedures for $\varepsilon$-DP, $(\varepsilon,\delta)$-DP and $(\alpha,\varepsilon)$-R\'enyi DP.
    TabNAS: Rejection Sampling for Neural Architecture Search on Tabular Datasets. (arXiv:2204.07615v2 [cs.LG] UPDATED)
    The best neural architecture for a given machine learning problem depends on many factors: not only the complexity and structure of the dataset, but also on resource constraints including latency, compute, energy consumption, etc. Neural architecture search (NAS) for tabular datasets is an important but under-explored problem. Previous NAS algorithms designed for image search spaces incorporate resource constraints directly into the reinforcement learning (RL) rewards. However, for NAS on tabular datasets, this protocol often discovers suboptimal architectures. This paper develops TabNAS, a new and more effective approach to handle resource constraints in tabular NAS using an RL controller motivated by the idea of rejection sampling. TabNAS immediately discards any architecture that violates the resource constraints without training or learning from that architecture. TabNAS uses a Monte-Carlo-based correction to the RL policy gradient update to account for this extra filtering step. Results on several tabular datasets demonstrate the superiority of TabNAS over previous reward-shaping methods: it finds better models that obey the constraints.  ( 2 min )
    Estimation of Optimal Dynamic Treatment Assignment Rules under Policy Constraints. (arXiv:2106.05031v3 [econ.EM] UPDATED)
    This paper studies statistical decisions for dynamic treatment assignment problems. Many policies involve dynamics in their treatment assignments where treatments are sequentially assigned to individuals across multiple stages and the effect of treatment at each stage is usually heterogeneous with respect to the prior treatments, past outcomes, and observed covariates. We consider estimating an optimal dynamic treatment rule that guides the optimal treatment assignment for each individual at each stage based on the individual's history. This paper proposes an empirical welfare maximization approach in a dynamic framework. The approach estimates the optimal dynamic treatment rule from panel data taken from an experimental or quasi-experimental study. The paper proposes two estimation methods: one solves the treatment assignment problem at each stage through backward induction, and the other solves the whole dynamic treatment assignment problem simultaneously across all stages. We derive finite-sample upper bounds on the worst-case average welfare-regrets for the proposed methods and show $n^{-1/2}$-minimax convergence rates. We also modify the simultaneous estimation method to incorporate intertemporal budget/capacity constraints.  ( 2 min )
    Learning with Stochastic Orders. (arXiv:2205.13684v1 [stat.ML])
    Learning high-dimensional distributions is often done with explicit likelihood modeling or implicit modeling via minimizing integral probability metrics (IPMs). In this paper, we expand this learning paradigm to stochastic orders, namely, the convex or Choquet order between probability measures. Towards this end, we introduce the Choquet-Toland distance between probability measures, that can be used as a drop-in replacement for IPMs. We also introduce the Variational Dominance Criterion (VDC) to learn probability measures with dominance constraints, that encode the desired stochastic order between the learned measure and a known baseline. We analyze both quantities and show that they suffer from the curse of dimensionality and propose surrogates via input convex maxout networks (ICMNs), that enjoy parametric rates. Finally, we provide a min-max framework for learning with stochastic orders and validate it experimentally on synthetic and high-dimensional image generation, with promising results. The code is available at https://github.com/yair-schiff/stochastic-orders-ICMN  ( 2 min )
    Generative Archimedean Copulas. (arXiv:2102.11351v3 [cs.LG] CROSS LISTED)
    We propose a new generative modeling technique for learning multidimensional cumulative distribution functions (CDFs) in the form of copulas. Specifically, we consider certain classes of copulas known as Archimedean and hierarchical Archimedean copulas, popular for their parsimonious representation and ability to model different tail dependencies. We consider their representation as mixture models with Laplace transforms of latent random variables from generative neural networks. This alternative representation allows for computational efficiencies and easy sampling, especially in high dimensions. We describe multiple methods for optimizing the network parameters. Finally, we present empirical results that demonstrate the efficacy of our proposed method in learning multidimensional CDFs and its computational efficiency compared to existing methods.  ( 2 min )
    Learning to Reason with Neural Networks: Generalization, Unseen Data and Boolean Measures. (arXiv:2205.13647v1 [cs.LG])
    This paper considers the Pointer Value Retrieval (PVR) benchmark introduced in [ZRKB21], where a 'reasoning' function acts on a string of digits to produce the label. More generally, the paper considers the learning of logical functions with gradient descent (GD) on neural networks. It is first shown that in order to learn logical functions with gradient descent on symmetric neural networks, the generalization error can be lower-bounded in terms of the noise-stability of the target function, supporting a conjecture made in [ZRKB21]. It is then shown that in the distribution shift setting, when the data withholding corresponds to freezing a single feature (referred to as canonical holdout), the generalization error of gradient descent admits a tight characterization in terms of the Boolean influence for several relevant architectures. This is shown on linear models and supported experimentally on other models such as MLPs and Transformers. In particular, this puts forward the hypothesis that for such architectures and for learning logical functions such as PVR functions, GD tends to have an implicit bias towards low-degree representations, which in turn gives the Boolean influence for the generalization error under quadratic loss.  ( 2 min )
    Hazard Gradient Penalty for Survival Analysis. (arXiv:2205.13717v1 [cs.LG])
    Survival analysis appears in various fields such as medicine, economics, engineering, and business. Recent studies showed that the Ordinary Differential Equation (ODE) modeling framework unifies many existing survival models while the framework is flexible and widely applicable. However, naively applying the ODE framework to survival analysis problems may model fiercely changing density function which may worsen the model's performance. Though we can apply L1 or L2 regularizers to the ODE model, their effect on the ODE modeling framework is barely known. In this paper, we propose hazard gradient penalty (HGP) to enhance the performance of a survival analysis model. Our method imposes constraints on local data points by regularizing the gradient of hazard function with respect to the data point. Our method applies to any survival analysis model including the ODE modeling framework and is easy to implement. We theoretically show that our method is related to minimizing the KL divergence between the density function at a data point and that of the neighborhood points. Experimental results on three public benchmarks show that our approach outperforms other regularization methods.  ( 2 min )
    Fast variable selection makes scalable Gaussian process BSS-ANOVA a speedy and accurate choice for tabular and time series regression. (arXiv:2205.13676v1 [cs.LG])
    Gaussian processes (GPs) are non-parametric regression engines with a long history. They are often overlooked in modern machine learning contexts because of scalability issues: regression for traditional GP kernels are $\mathcal{O}(N^3)$ where $N$ is the size of the dataset. One of a number of scalable GP approaches is the Karhunen-Lo\'eve (KL) decomposed kernel BSS-ANOVA, developed in 2009. It is $\mathcal{O}(NP)$ in training and $\mathcal{O}(P)$ per point in prediction, where $P$ is the number of terms in the ANOVA / KL expansion. A new method of forward variable selection, quickly and effectively limits the number of terms, yielding a method with competitive accuracies, training and inference times for large tabular datasets. The new algorithm balances model fidelity with model complexity using Bayesian and Akaike information criteria (BIC/AIC). The inference speed and accuracy makes the method especially useful for modeling dynamic systems in a model-free manner, by modeling the derivative in a dynamic system as a static problem, then integrating the learned dynamics using a high-order scheme. The methods are demonstrated on a `Susceptible, Infected, Recovered' (SIR) toy problem, with the transmissibility used as forcing function, along with the `Cascaded Tanks' benchmark dataset. Comparisons on the static prediction of derivatives are made with a Random Forest and Residual Neural Network, while for the timeseries prediction comparisons are made with LSTM and GRU recurrent neural networks. The GP outperforms the other methods in all modeling tasks on accuracy, while (in the case of the neural networks) performing many orders of magnitude fewer calculations. For the SIR test, which involved prediction for a set of forcing functions qualitatively different from those appearing in the training set, the GP captured the correct dynamics while the neural networks failed to do so.  ( 3 min )
    Learning to Control Linear Systems can be Hard. (arXiv:2205.14035v1 [cs.LG])
    In this paper, we study the statistical difficulty of learning to control linear systems. We focus on two standard benchmarks, the sample complexity of stabilization, and the regret of the online learning of the Linear Quadratic Regulator (LQR). Prior results state that the statistical difficulty for both benchmarks scales polynomially with the system state dimension up to system-theoretic quantities. However, this does not reveal the whole picture. By utilizing minimax lower bounds for both benchmarks, we prove that there exist non-trivial classes of systems for which learning complexity scales dramatically, i.e. exponentially, with the system dimension. This situation arises in the case of underactuated systems, i.e. systems with fewer inputs than states. Such systems are structurally difficult to control and their system theoretic quantities can scale exponentially with the system dimension dominating learning complexity. Under some additional structural assumptions (bounding systems away from uncontrollability), we provide qualitatively matching upper bounds. We prove that learning complexity can be at most exponential with the controllability index of the system, that is the degree of underactuation.  ( 2 min )
    Unequal Covariance Awareness for Fisher Discriminant Analysis and Its Variants in Classification. (arXiv:2205.13565v1 [cs.LG])
    Fisher Discriminant Analysis (FDA) is one of the essential tools for feature extraction and classification. In addition, it motivates the development of many improved techniques based on the FDA to adapt to different problems or data types. However, none of these approaches make use of the fact that the assumption of equal covariance matrices in FDA is usually not satisfied in practical situations. Therefore, we propose a novel classification rule for the FDA that accounts for this fact, mitigating the effect of unequal covariance matrices in the FDA. Furthermore, since we only modify the classification rule, the same can be applied to many FDA variants, improving these algorithms further. Theoretical analysis reveals that the new classification rule allows the implicit use of the class covariance matrices while increasing the number of parameters to be estimated by a small amount compared to going from FDA to Quadratic Discriminant Analysis. We illustrate our idea via experiments, which show the superior performance of the modified algorithms based on our new classification rule compared to the original ones.  ( 2 min )
    A gradient estimator via L1-randomization for online zero-order optimization with two point feedback. (arXiv:2205.13910v1 [math.ST])
    This work studies online zero-order optimization of convex and Lipschitz functions. We present a novel gradient estimator based on two function evaluation and randomization on the $\ell_1$-sphere. Considering different geometries of feasible sets and Lipschitz assumptions we analyse online mirror descent algorithm with our estimator in place of the usual gradient. We consider two types of assumptions on the noise of the zero-order oracle: canceling noise and adversarial noise. We provide an anytime and completely data-driven algorithm, which is adaptive to all parameters of the problem. In the case of canceling noise that was previously studied in the literature, our guarantees are either comparable or better than state-of-the-art bounds obtained by~\citet{duchi2015} and \citet{Shamir17} for non-adaptive algorithms. Our analysis is based on deriving a new Poincar\'e type inequality for the uniform measure on the $\ell_1$-sphere with explicit constants, which may be of independent interest.  ( 2 min )
    HOUDINI: Escaping from Moderately Constrained Saddles. (arXiv:2205.13753v1 [cs.LG])
    We give the first polynomial time algorithms for escaping from high-dimensional saddle points under a moderate number of constraints. Given gradient access to a smooth function $f \colon \mathbb R^d \to \mathbb R$ we show that (noisy) gradient descent methods can escape from saddle points under a logarithmic number of inequality constraints. This constitutes the first tangible progress (without reliance on NP-oracles or altering the definitions to only account for certain constraints) on the main open question of the breakthrough work of Ge et al. who showed an analogous result for unconstrained and equality-constrained problems. Our results hold for both regular and stochastic gradient descent.  ( 2 min )
    Efficient Approximation of Gromov-Wasserstein Distance using Importance Sparsification. (arXiv:2205.13573v1 [cs.LG])
    As a valid metric of metric-measure spaces, Gromov-Wasserstein (GW) distance has shown the potential for the matching problems of structured data like point clouds and graphs. However, its application in practice is limited due to its high computational complexity. To overcome this challenge, we propose a novel importance sparsification method, called Spar-GW, to approximate GW distance efficiently. In particular, instead of considering a dense coupling matrix, our method leverages a simple but effective sampling strategy to construct a sparse coupling matrix and update it with few computations. We demonstrate that the proposed Spar-GW method is applicable to the GW distance with arbitrary ground cost, and it reduces the complexity from $\mathcal{O}(n^4)$ to $\mathcal{O}(n^{2+\delta})$ for an arbitrary small $\delta>0$. In addition, this method can be extended to approximate the variants of GW distance, including the entropic GW distance, the fused GW distance, and the unbalanced GW distance. Experiments show the superiority of our Spar-GW to state-of-the-art methods in both synthetic and real-world tasks.  ( 2 min )
    Combining observational datasets from multiple environments to detect hidden confounding. (arXiv:2205.13935v1 [stat.ME])
    A common assumption in causal inference from observational data is the assumption of no hidden confounding. Yet it is, in general, impossible to verify the presence of hidden confounding factors from a single dataset. However, under the assumption of independent causal mechanisms underlying the data generative process, we demonstrate a way to detect unobserved confounders when having multiple observational datasets coming from different environments. We present a theory for testable conditional independencies that are only violated during hidden confounding and examine cases where we break its assumptions: degenerate & dependent mechanisms, and faithfulness violations. Additionally, we propose a procedure to test these independencies and study its empirical finite-sample behavior using simulation studies.  ( 2 min )
    Understanding new tasks through the lens of training data via exponential tilting. (arXiv:2205.13577v1 [cs.LG])
    Deploying machine learning models to new tasks is a major challenge despite the large size of the modern training datasets. However, it is conceivable that the training data can be reweighted to be more representative of the new (target) task. We consider the problem of reweighing the training samples to gain insights into the distribution of the target task. Specifically, we formulate a distribution shift model based on the exponential tilt assumption and learn train data importance weights minimizing the KL divergence between labeled train and unlabeled target datasets. The learned train data weights can then be used for downstream tasks such as target performance evaluation, fine-tuning, and model selection. We demonstrate the efficacy of our method on Waterbirds and Breeds benchmarks.  ( 2 min )
    Probabilistic Forecasting with Generative Networks via Scoring Rule Minimization. (arXiv:2112.08217v2 [stat.ML] UPDATED)
    Generative networks are often trained to minimize a statistical divergence between the reference distribution and the generative one in an adversarial setting. Some works trained instead generative networks to minimize Scoring Rules, functions assessing how well the generative distribution matches each training sample individually. We show how the Scoring Rule formulation easily extends to the so-called prequential (predictive-sequential) score, whose minimization allows performing probabilistic forecasting with generative networks. This objective leads to adversarial-free training, therefore easily avoiding uncertainty underestimation due to mode collapse, which is a common issue in the adversarial setting and undesirable for probabilistic forecasting. We provide consistency guarantees for the minimizer of the prequential score and employ that to perform probabilistic forecasting for two chaotic dynamical models and a benchmark dataset of global weather observations. For this last example, we define scoring rules for spatial data by drawing from the relevant literature, with which we obtain better uncertainty quantification with little hyperparameter tuning compared to adversarial training.  ( 2 min )
    Distributionally Robust Bayesian Optimization with $\phi$-divergences. (arXiv:2203.02128v2 [cs.LG] UPDATED)
    The study of robustness has received much attention due to its inevitability in data-driven settings where many systems face uncertainty. One such example of concern is Bayesian Optimization (BO), where uncertainty is multi-faceted, yet there only exists a limited number of works dedicated to this direction. In particular, there is the work of Kirschner et al. (2020), which bridges the existing literature of Distributionally Robust Optimization (DRO) by casting the BO problem from the lens of DRO. While this work is pioneering, it admittedly suffers from various practical shortcomings such as finite contexts assumptions, leaving behind the main question Can one devise a computationally tractable algorithm for solving this DRO-BO problem? In this work, we tackle this question to a large degree of generality by considering robustness against data-shift in $\phi$-divergences, which subsumes many popular choices, such as the $\chi^2$-divergence, Total Variation, and the extant Kullback-Leibler (KL) divergence. We show that the DRO-BO problem in this setting is equivalent to a finite-dimensional optimization problem which, even in the continuous context setting, can be easily implemented with provable sublinear regret bounds. We then show experimentally that our method surpasses existing methods, attesting to the theoretical results  ( 2 min )
    Meta-Learning Adversarial Bandits. (arXiv:2205.14128v1 [cs.LG])
    We study online learning with bandit feedback across multiple tasks, with the goal of improving average performance across tasks if they are similar according to some natural task-similarity measure. As the first to target the adversarial setting, we design a unified meta-algorithm that yields setting-specific guarantees for two important cases: multi-armed bandits (MAB) and bandit linear optimization (BLO). For MAB, the meta-algorithm tunes the initialization, step-size, and entropy parameter of the Tsallis-entropy generalization of the well-known Exp3 method, with the task-averaged regret provably improving if the entropy of the distribution over estimated optima-in-hindsight is small. For BLO, we learn the initialization, step-size, and boundary-offset of online mirror descent (OMD) with self-concordant barrier regularizers, showing that task-averaged regret varies directly with a measure induced by these functions on the interior of the action space. Our adaptive guarantees rely on proving that unregularized follow-the-leader combined with multiplicative weights is enough to online learn a non-smooth and non-convex sequence of affine functions of Bregman divergences that upper-bound the regret of OMD.  ( 2 min )
    Comparing two samples through stochastic dominance: a graphical approach. (arXiv:2203.07889v2 [stat.ML] UPDATED)
    Non-deterministic measurements are common in real-world scenarios: the performance of a stochastic optimization algorithm or the total reward of a reinforcement learning agent in a chaotic environment are just two examples in which unpredictable outcomes are common. These measures can be modeled as random variables and compared among each other via their expected values or more sophisticated tools such as null hypothesis statistical tests. In this paper, we propose an alternative framework to visually compare two samples according to their estimated cumulative distribution functions. First, we introduce a dominance measure for two random variables that quantifies the proportion in which the cumulative distribution function of one of the random variables scholastically dominates the other one. Then, we present a graphical method that decomposes in quantiles i) the proposed dominance measure and ii) the probability that one of the random variables takes lower values than the other. With illustrative purposes, we re-evaluate the experimentation of an already published work with the proposed methodology and we show that additional conclusions (missed by the rest of the methods) can be inferred. Additionally, the software package RVCompare was created as a convenient way of applying and experimenting with the proposed framework.  ( 2 min )
    Surrogate modeling for Bayesian optimization beyond a single Gaussian process. (arXiv:2205.14090v1 [stat.ML])
    Bayesian optimization (BO) has well-documented merits for optimizing black-box functions with an expensive evaluation cost. Such functions emerge in applications as diverse as hyperparameter tuning, drug discovery, and robotics. BO hinges on a Bayesian surrogate model to sequentially select query points so as to balance exploration with exploitation of the search space. Most existing works rely on a single Gaussian process (GP) based surrogate model, where the kernel function form is typically preselected using domain knowledge. To bypass such a design process, this paper leverages an ensemble (E) of GPs to adaptively select the surrogate model fit on-the-fly, yielding a GP mixture posterior with enhanced expressiveness for the sought function. Acquisition of the next evaluation input using this EGP-based function posterior is then enabled by Thompson sampling (TS) that requires no additional design parameters. To endow function sampling with scalability, random feature-based kernel approximation is leveraged per GP model. The novel EGP-TS readily accommodates parallel operation. To further establish convergence of the proposed EGP-TS to the global optimum, analysis is conducted based on the notion of Bayesian regret for both sequential and parallel settings. Tests on synthetic functions and real-world applications showcase the merits of the proposed method.  ( 2 min )
    A Unified Analysis of Federated Learning with Arbitrary Client Participation. (arXiv:2205.13648v1 [cs.LG])
    Federated learning (FL) faces challenges of intermittent client availability and computation/communication efficiency. As a result, only a small subset of clients can participate in FL at a given time. It is important to understand how partial client participation affects convergence, but most existing works have either considered idealized participation patterns or obtained results with non-zero optimality error for generic patterns. In this paper, we provide a unified convergence analysis for FL with arbitrary client participation. We first introduce a generalized version of federated averaging (FedAvg) that amplifies parameter updates at an interval of multiple FL rounds. Then, we present a novel analysis that captures the effect of client participation in a single term. By analyzing this term, we obtain convergence upper bounds for a wide range of participation patterns, including both non-stochastic and stochastic cases, which match either the lower bound of stochastic gradient descent (SGD) or the state-of-the-art results in specific settings. We also discuss various insights, recommendations, and experimental results.  ( 2 min )
    DP-PCA: Statistically Optimal and Differentially Private PCA. (arXiv:2205.13709v1 [cs.LG])
    We study the canonical statistical task of computing the principal component from $n$ i.i.d.~data in $d$ dimensions under $(\varepsilon,\delta)$-differential privacy. Although extensively studied in literature, existing solutions fall short on two key aspects: ($i$) even for Gaussian data, existing private algorithms require the number of samples $n$ to scale super-linearly with $d$, i.e., $n=\Omega(d^{3/2})$, to obtain non-trivial results while non-private PCA requires only $n=O(d)$, and ($ii$) existing techniques suffer from a non-vanishing error even when the randomness in each data point is arbitrarily small. We propose DP-PCA, which is a single-pass algorithm that overcomes both limitations. It is based on a private minibatch gradient ascent method that relies on {\em private mean estimation}, which adds minimal noise required to ensure privacy by adapting to the variance of a given minibatch of gradients. For sub-Gaussian data, we provide nearly optimal statistical error rates even for $n=\tilde O(d)$. Furthermore, we provide a lower bound showing that sub-Gaussian style assumption is necessary in obtaining the optimal error rate.  ( 2 min )
    Evolution of beliefs in social networks. (arXiv:2205.13587v1 [cs.LG])
    Evolution of beliefs of a society are a product of interactions between people (horizontal transmission) in the society over generations (vertical transmission). Researchers have studied both horizontal and vertical transmission separately. Extending prior work, we propose a new theoretical framework which allows application of tools from Markov chain theory to the analysis of belief evolution via horizontal and vertical transmission. We analyze three cases: static network, randomly changing network, and homophily-based dynamic network. Whereas the former two assume network structure is independent of beliefs, the latter assumes that people tend to communicate with those who have similar beliefs. We prove under general conditions that both static and randomly changing networks converge to a single set of beliefs among all individuals along with the rate of convergence. We prove that homophily-based network structures do not in general converge to a single set of beliefs shared by all and prove lower bounds on the number of different limiting beliefs as a function of initial beliefs. We conclude by discussing implications for prior theories and directions for future work.  ( 2 min )
    Chaos is a Ladder: A New Theoretical Understanding of Contrastive Learning via Augmentation Overlap. (arXiv:2203.13457v2 [cs.LG] UPDATED)
    Recently, contrastive learning has risen to be a promising approach for large-scale self-supervised learning. However, theoretical understanding of how it works is still unclear. In this paper, we propose a new guarantee on the downstream performance without resorting to the conditional independence assumption that is widely adopted in previous work but hardly holds in practice. Our new theory hinges on the insight that the support of different intra-class samples will become more overlapped under aggressive data augmentations, thus simply aligning the positive samples (augmented views of the same sample) could make contrastive learning cluster intra-class samples together. Based on this augmentation overlap perspective, theoretically, we obtain asymptotically closed bounds for downstream performance under weaker assumptions, and empirically, we propose an unsupervised model selection metric ARC that aligns well with downstream accuracy. Our theory suggests an alternative understanding of contrastive learning: the role of aligning positive samples is more like a surrogate task than an ultimate goal, and the overlapped augmented views (i.e., the chaos) create a ladder for contrastive learning to gradually learn class-separated representations. The code for computing ARC is available at https://github.com/zhangq327/ARC.  ( 2 min )
    Topological Hidden Markov Models. (arXiv:2205.13608v1 [stat.ME])
    The hidden Markov model (HMM) is a classic modeling tool with a wide swath of applications. Its inception considered observations restricted to a finite alphabet, but it was quickly extended to multivariate continuous distributions. In this article, we further extend the HMM from mixtures of normal distributions in $d$-dimensional Euclidean space to general Gaussian measure mixtures in locally convex topological spaces. The main innovation is the use of the Onsager-Machlup functional as a proxy for the probability density function in infinite dimensional spaces. This allows for choice of a Cameron-Martin space suitable for a given application. We demonstrate the versatility of this methodology by applying it to simulated diffusion processes such as Brownian and fractional Brownian sample paths as well as the Ornstein-Uhlenbeck process. Our methodology is applied to the identification of sleep states from overnight polysomnography time series data with the aim of diagnosing Obstructive Sleep Apnea in pediatric patients. It is also applied to a series of annual cumulative snowfall curves from 1940 to 1990 in the city of Edmonton, Alberta.
    Finite mixture of skewed sub-Gaussian stable distributions. (arXiv:2205.14067v1 [stat.ME])
    We propose the finite mixture of skewed sub-Gaussian stable distributions. The maximum likelihood estimator for the parameters of proposed finite mixture model is computed through the expectation-maximization algorithm. The proposed model contains the finite mixture of normal and skewed normal distributions. Since the tails of proposed model is heavier than even the Student's t distribution, it can be used as a powerful model for robust model-based clustering. Performance of the proposed model is demonstrated by clustering simulation data and two sets of real data.  ( 2 min )
    An Ensemble of Pre-trained Transformer Models For Imbalanced Multiclass Malware Classification. (arXiv:2112.13236v3 [cs.CR] UPDATED)
    Classification of malware families is crucial for a comprehensive understanding of how they can infect devices, computers, or systems. Thus, malware identification enables security researchers and incident responders to take precautions against malware and accelerate mitigation. API call sequences made by malware are widely utilized features by machine and deep learning models for malware classification as these sequences represent the behavior of malware. However, traditional machine and deep learning models remain incapable of capturing sequence relationships between API calls. On the other hand, the transformer-based models process sequences as a whole and learn relationships between API calls due to multi-head attention mechanisms and positional embeddings. Our experiments demonstrate that the transformer model with one transformer block layer surpassed the widely used base architecture, LSTM. Moreover, BERT or CANINE, pre-trained transformer models, outperformed in classifying highly imbalanced malware families according to evaluation metrics, F1-score, and AUC score. Furthermore, the proposed bagging-based random transformer forest (RTF), an ensemble of BERT or CANINE, has reached the state-of-the-art evaluation scores on three out of four datasets, particularly state-of-the-art F1-score of 0.6149 on one of the commonly used benchmark dataset.
    Low-rank lottery tickets: finding efficient low-rank neural networks via matrix differential equations. (arXiv:2205.13571v1 [cs.LG])
    Neural networks have achieved tremendous success in a large variety of applications. However, their memory footprint and computational demand can render them impractical in application settings with limited hardware or energy resources. In this work, we propose a novel algorithm to find efficient low-rank subnetworks. Remarkably, these subnetworks are determined and adapted already during the training phase and the overall time and memory resources required by both training and evaluating them is significantly reduced. The main idea is to restrict the weight matrices to a low-rank manifold and to update the low-rank factors rather than the full matrix during training. To derive training updates that are restricted to the prescribed manifold, we employ techniques from dynamic model order reduction for matrix differential equations. Moreover, our method automatically and dynamically adapts the ranks during training to achieve a desired approximation accuracy. The efficiency of the proposed method is demonstrated through a variety of numerical experiments on fully-connected and convolutional networks.
    MissDAG: Causal Discovery in the Presence of Missing Data with Continuous Additive Noise Models. (arXiv:2205.13869v1 [cs.LG])
    State-of-the-art causal discovery methods usually assume that the observational data is complete. However, the missing data problem is pervasive in many practical scenarios such as clinical trials, economics, and biology. One straightforward way to address the missing data problem is first to impute the data using off-the-shelf imputation methods and then apply existing causal discovery methods. However, such a two-step method may suffer from suboptimality, as the imputation algorithm is unaware of the causal discovery step. In this paper, we develop a general method, which we call MissDAG, to perform causal discovery from data with incomplete observations. Focusing mainly on the assumptions of ignorable missingness and the identifiable additive noise models (ANMs), MissDAG maximizes the expected likelihood of the visible part of observations under the expectation-maximization (EM) framework. In the E-step, in cases where computing the posterior distributions of parameters in closed-form is not feasible, Monte Carlo EM is leveraged to approximate the likelihood. In the M-step, MissDAG leverages the density transformation to model the noise distributions with simpler and specific formulations by virtue of the ANMs and uses a likelihood-based causal discovery algorithm with directed acyclic graph prior as an inductive bias. We demonstrate the flexibility of MissDAG for incorporating various causal discovery algorithms and its efficacy through extensive simulations and real data experiments.
    Average Adjusted Association: Efficient Estimation with High Dimensional Confounders. (arXiv:2205.14048v1 [stat.ME])
    The log odds ratio is a common parameter to measure association between (binary) outcome and exposure variables. Much attention has been paid to its parametric but robust estimation, or its nonparametric estimation as a function of confounders. However, discussion on how to use a summary statistic by averaging the log odds ratio function is surprisingly difficult to find despite the popularity and importance of averaging in other contexts such as estimating the average treatment effect. We propose a couple of efficient double/debiased machine learning (DML) estimators of the average log odds ratio, where the odds ratios are adjusted for observed (potentially high dimensional) confounders and are averaged over them. The estimators are built from two equivalent forms of the efficient influence function. The first estimator uses a prospective probability of the outcome conditional on the exposure and confounders; the second one employs a retrospective probability of the exposure conditional on the outcome and confounders. Our framework encompasses random sampling as well as outcome-based or exposure-based sampling. Finally, we illustrate how to apply the proposed estimators using real data.
    Benchpress: A Scalable and Versatile Workflow for Benchmarking Structure Learning Algorithms. (arXiv:2107.03863v3 [stat.ML] UPDATED)
    Describing the relationship between the variables in a study domain and modelling the data generating mechanism is a fundamental problem in many empirical sciences. Probabilistic graphical models are one common approach to tackle the problem. Learning the graphical structure for such models is computationally challenging and a fervent area of current research with a plethora of algorithms being developed. To facilitate the benchmarking of different methods, we present a novel Snakemake workflow, called Benchpress for producing scalable, reproducible, and platform-independent benchmarks of structure learning algorithms for probabilistic graphical models. Benchpress is interfaced via a simple JSON-file, which makes it accessible for all users, while the code is designed in a fully modular fashion to enable researchers to contribute additional methodologies. Benchpress currently provides an interface to a large number of state-of-the-art algorithms from libraries such as BDgraph, BiDAG, bnlearn, gCastle, GOBNILP, pcalg, r.blip, scikit-learn, TETRAD, and trilearn as well as a variety of methods for data generating models and performance evaluation. Alongside user-defined models and randomly generated datasets, the workflow also includes a number of standard datasets and graphical models from the literature, which may be included in a benchmarking study. We demonstrate the applicability of this workflow for learning Bayesian networks in five typical data scenarios. The source code and documentation is publicly available from this http URL
    Explaining Preferences with Shapley Values. (arXiv:2205.13662v1 [stat.ML])
    While preference modelling is becoming one of the pillars of machine learning, the problem of preference explanation remains challenging and underexplored. In this paper, we propose \textsc{Pref-SHAP}, a Shapley value-based model explanation framework for pairwise comparison data. We derive the appropriate value functions for preference models and further extend the framework to model and explain \emph{context specific} information, such as the surface type in a tennis game. To demonstrate the utility of \textsc{Pref-SHAP}, we apply our method to a variety of synthetic and real-world datasets and show that richer and more insightful explanations can be obtained over the baseline.
    A Multilabel Classification Framework for Approximate Nearest Neighbor Search. (arXiv:1910.08322v4 [cs.LG] UPDATED)
    Both supervised and unsupervised machine learning algorithms have been used to learn partition-based index structures for approximate nearest neighbor (ANN) search. Existing supervised algorithms formulate the learning task as finding a partition in which the nearest neighbors of a training set point belong to the same partition element as the point itself, so that the nearest neighbor candidates can be retrieved by naive lookup or backtracking search. We formulate candidate set selection in ANN search directly as a multilabel classification problem where the labels correspond to the nearest neighbors of the query point, and interpret the partitions as partitioning classifiers for solving this task. Empirical results suggest that the natural classifier based on this interpretation leads to strictly improved performance when combined with any unsupervised or supervised partitioning strategy. We also prove a sufficient condition for consistency of a partitioning classifier for ANN search, and illustrate the result by verifying this condition for chronological $k$-d trees.
    Asymptotic Convergence Rate and Statistical Inference for Stochastic Sequential Quadratic Programming. (arXiv:2205.13687v1 [math.OC])
    We apply a stochastic sequential quadratic programming (StoSQP) algorithm to solve constrained nonlinear optimization problems, where the objective is stochastic and the constraints are deterministic. We study a fully stochastic setup, where only a single sample is available in each iteration for estimating the gradient and Hessian of the objective. We allow StoSQP to select a random stepsize $\bar{\alpha}_t$ adaptively, such that $\beta_t\leq \bar{\alpha}_t \leq \beta_t+\chi_t$, where $\beta_t$, $\chi_t=o(\beta_t)$ are prespecified deterministic sequences. We also allow StoSQP to solve Newton system inexactly via randomized iterative solvers, e.g., with the sketch-and-project method; and we do not require the approximation error of inexact Newton direction to vanish. For this general StoSQP framework, we establish the asymptotic convergence rate for its last iterate, with the worst-case iteration complexity as a byproduct; and we perform statistical inference. In particular, with proper decaying $\beta_t,\chi_t$, we show that: (i) the StoSQP scheme can take at most $O(1/\epsilon^4)$ iterations to achieve $\epsilon$-stationarity; (ii) asymptotically and almost surely, $\|(x_t -x^\star, \lambda_t - \lambda^\star)\| = O(\sqrt{\beta_t\log(1/\beta_t)})+O(\chi_t/\beta_t)$, where $(x_t,\lambda_t)$ is the primal-dual StoSQP iterate; (iii) the sequence $1/\sqrt{\beta_t}\cdot (x_t -x^\star, \lambda_t - \lambda^\star)$ converges to a mean zero Gaussian distribution with a nontrivial covariance matrix. Moreover, we establish the Berry-Esseen bound for $(x_t, \lambda_t)$ to measure quantitatively the convergence of its distribution function. We also provide a practical estimator for the covariance matrix, from which the confidence intervals of $(x^\star, \lambda^\star)$ can be constructed using iterates $\{(x_t,\lambda_t)\}_t$. Our theorems are validated using nonlinear problems in CUTEst test set.
    Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes. (arXiv:2205.13589v1 [cs.LG])
    We study offline reinforcement learning (RL) in partially observable Markov decision processes. In particular, we aim to learn an optimal policy from a dataset collected by a behavior policy which possibly depends on the latent state. Such a dataset is confounded in the sense that the latent state simultaneously affects the action and the observation, which is prohibitive for existing offline RL algorithms. To this end, we propose the \underline{P}roxy variable \underline{P}essimistic \underline{P}olicy \underline{O}ptimization (\texttt{P3O}) algorithm, which addresses the confounding bias and the distributional shift between the optimal and behavior policies in the context of general function approximation. At the core of \texttt{P3O} is a coupled sequence of pessimistic confidence regions constructed via proximal causal inference, which is formulated as minimax estimation. Under a partial coverage assumption on the confounded dataset, we prove that \texttt{P3O} achieves a $n^{-1/2}$-suboptimality, where $n$ is the number of trajectories in the dataset. To our best knowledge, \texttt{P3O} is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset.
    Error Bound of Empirical $\ell_2$ Risk Minimization for Noisy Standard and Generalized Phase Retrieval Problems. (arXiv:2205.13827v1 [stat.ML])
    A noisy generalized phase retrieval (NGPR) problem refers to a problem of estimating $x_0 \in \mathbb{C}^d$ by noisy quadratic samples $\big\{x_0^*A_kx_0+\eta_k\big\}_{k=1}^n$ where $A_k$ is a Hermitian matrix and $\eta_k$ is a noise scalar. When $A_k=\alpha_k\alpha_k^*$ for some $\alpha_k\in\mathbb{C}^d$, it reduces to a standard noisy phase retrieval (NPR) problem. The main aim of this paper is to study the estimation performance of empirical $\ell_2$ risk minimization in both problems when $A_k$ in NGPR, or $\alpha_k$ in NPR, is drawn from sub-Gaussian distribution. Under different kinds of noise patterns, we establish error bounds that can imply approximate reconstruction and these results are new in the literature. In NGPR, we show the bounds are of $O\big(\frac{||\eta||}{\sqrt{n}}\big)$ and $O\big(||\eta||_\infty \sqrt{\frac{d}{n}}\big)$ for general noise, and of $O\big(\sqrt{\frac{d\log n}{n}}\big)$ and $O\big(\sqrt{\frac{d(\log n)^2}{n}}\big)$ for random noise with sub-Gaussian and sub-exponential tail respectively, where $\| \eta \|$ and $\| \eta \|_{\infty}$ are the 2-norm and sup-norm of the noise vector of $\eta_k$. Under heavy-tailed noise, by truncating response outliers we propose a robust estimator that possesses an error bound with slower convergence rate. On the other hand, we obtain in NPR the bound is of $O\big(\sqrt{\frac{d\log n}{n}}\big)$ and $O\big(\sqrt{\frac{d(\log n)^2}{n}}\big)$) for sub-Gaussian and sub-exponential noise respectively, which is essentially tighter than the existing bound $O\big(\frac{||\eta||_2}{\sqrt{n}}\big)$. Although NGPR involving measurement matrix $A_k$ is more computationally demanding than NPR involving measurement vector $\alpha_k$, our results reveal that NGPR exhibits stronger robustness than NPR under biased and deterministic noise. Experimental results are presented to confirm and demonstrate our theoretical findings.
  • Open

    Comparing two samples through stochastic dominance: a graphical approach. (arXiv:2203.07889v2 [stat.ML] UPDATED)
    Non-deterministic measurements are common in real-world scenarios: the performance of a stochastic optimization algorithm or the total reward of a reinforcement learning agent in a chaotic environment are just two examples in which unpredictable outcomes are common. These measures can be modeled as random variables and compared among each other via their expected values or more sophisticated tools such as null hypothesis statistical tests. In this paper, we propose an alternative framework to visually compare two samples according to their estimated cumulative distribution functions. First, we introduce a dominance measure for two random variables that quantifies the proportion in which the cumulative distribution function of one of the random variables scholastically dominates the other one. Then, we present a graphical method that decomposes in quantiles i) the proposed dominance measure and ii) the probability that one of the random variables takes lower values than the other. With illustrative purposes, we re-evaluate the experimentation of an already published work with the proposed methodology and we show that additional conclusions (missed by the rest of the methods) can be inferred. Additionally, the software package RVCompare was created as a convenient way of applying and experimenting with the proposed framework.  ( 2 min )
    An Ensemble of Pre-trained Transformer Models For Imbalanced Multiclass Malware Classification. (arXiv:2112.13236v3 [cs.CR] UPDATED)
    Classification of malware families is crucial for a comprehensive understanding of how they can infect devices, computers, or systems. Thus, malware identification enables security researchers and incident responders to take precautions against malware and accelerate mitigation. API call sequences made by malware are widely utilized features by machine and deep learning models for malware classification as these sequences represent the behavior of malware. However, traditional machine and deep learning models remain incapable of capturing sequence relationships between API calls. On the other hand, the transformer-based models process sequences as a whole and learn relationships between API calls due to multi-head attention mechanisms and positional embeddings. Our experiments demonstrate that the transformer model with one transformer block layer surpassed the widely used base architecture, LSTM. Moreover, BERT or CANINE, pre-trained transformer models, outperformed in classifying highly imbalanced malware families according to evaluation metrics, F1-score, and AUC score. Furthermore, the proposed bagging-based random transformer forest (RTF), an ensemble of BERT or CANINE, has reached the state-of-the-art evaluation scores on three out of four datasets, particularly state-of-the-art F1-score of 0.6149 on one of the commonly used benchmark dataset.  ( 2 min )
    Seeing Differently, Acting Similarly: Heterogeneously Observable Imitation Learning. (arXiv:2106.09256v3 [cs.LG] UPDATED)
    In many real-world imitation learning tasks, the demonstrator and the learner have to act under totally different observation spaces. This situation brings significant obstacles to existing imitation learning approaches, since most of them learn policies under homogeneous observation spaces. On the other hand, previous studies under different observation spaces have strong assumptions that these two observation spaces coexist during the entire learning process. However, in reality, the observation coexistence will be limited due to the high cost of acquiring expert observations. In this work, we study this challenging problem with limited observation coexistence under heterogeneous observations: Heterogeneously Observable Imitation Learning (HOIL). We identify two underlying issues in HOIL, i.e., the dynamics mismatch and the support mismatch, and further propose the Importance Weighting with REjection (IWRE) algorithm based on importance-weighting and learning with rejection to solve HOIL problems. Experimental results show that IWRE can successfully solve various HOIL tasks, including the challenging tasks of transforming the vision-based demonstrations to random access memory (RAM)-based policies in the Atari domain, even with limited visual observations.  ( 2 min )
    Faster Optimization on Sparse Graphs via Neural Reparametrization. (arXiv:2205.13624v1 [cs.LG])
    In mathematical optimization, second-order Newton's methods generally converge faster than first-order methods, but they require the inverse of the Hessian, hence are computationally expensive. However, we discover that on sparse graphs, graph neural networks (GNN) can implement an efficient Quasi-Newton method that can speed up optimization by a factor of 10-100x. Our method, neural reparametrization, modifies the optimization parameters as the output of a GNN to reshape the optimization landscape. Using a precomputed Hessian as the propagation rule, the GNN can effectively utilize the second-order information, reaching a similar effect as adaptive gradient methods. As our method solves optimization through architecture design, it can be used in conjunction with any optimizers such as Adam and RMSProp. We show the application of our method on scientifically relevant problems including heat diffusion, synchronization and persistent homology.  ( 2 min )
    Notes on Generalizing the Maximum Entropy Principle to Uncertain Data. (arXiv:2109.04530v2 [cs.IT] UPDATED)
    The principle of maximum entropy is a broadly applicable technique for computing a distribution with the least amount of information possible constrained to match empirical data, for instance, feature expectations. We seek to generalize this principle to scenarios where the empirical feature expectations cannot be computed because the model variables are only partially observed, which introduces a dependency on the learned model. Generalizing the principle of latent maximum entropy, we introduce uncertain maximum entropy and describe an expectation-maximization based solution to approximately solve these problems. We show that our technique additionally generalizes the principle of maximum entropy. We additionally discuss the use of black box classifiers with our technique, which simplifies the process of utilizing sparse, large data sets.  ( 2 min )
    Characterizing Parametric and Convergence Stability in Nonconvex and Nonsmooth Optimizations: A Geometric Approach. (arXiv:2204.01643v2 [cs.GT] UPDATED)
    We consider stability issues in minimizing a continuous (probably parameterized, nonconvex and nonsmooth) real-valued function $f$. We call a point stationary if all its possible directional derivatives are nonnegative. In this work, we focus on two notions of stability on stationary points of $f$: parametric stability and convergence stability. Parametric considerations are widely studied in various fields, including smoothed analysis, numerical stability, condition numbers and sensitivity analysis for linear programming. Parametric stability asks whether minor perturbations on parameters lead to dramatic changes in the position and $f$ value of a stationary point. Meanwhile, convergence stability indicates a non-escapable solution: Any point sequence iteratively produced by an optimization algorithm cannot escape from a neighborhood of a stationary point but gets close to it in the sense that such stationary points are stable to the precision parameter and algorithmic numerical errors. It turns out that these notions have deep connections to geometry theory. We show that parametric stability is linked to deformations of graphs of functions. On the other hand, convergence stability is concerned with area partitioning of the function domain. Utilizing these connections, we prove quite tight conditions of these two stability notions for a wide range of functions and optimization algorithms with small enough step sizes and precision parameters. These conditions are subtle in the sense that a slightly weaker function requirement goes to the opposite of primitive intuitions and leads to wrong conclusions. We present three applications of this theory. These applications reveal some understanding on Nash equilibrium computation, nonconvex and nonsmooth optimization, as well as the new optimization methodology of deep neural networks.  ( 3 min )
    Waiting but not Aging: Optimizing Information Freshness Under the Pull Model. (arXiv:1912.08722v4 [cs.NI] UPDATED)
    The Age-of-Information is an important metric for investigating the timeliness performance in information-update systems. In this paper, we study the AoI minimization problem under a new Pull model with replication schemes, where a user proactively sends a replicated request to multiple servers to "pull" the information of interest. Interestingly, we find that under this new Pull model, replication schemes capture a novel tradeoff between different values of the AoI across the servers (due to the random updating processes) and different response times across the servers, which can be exploited to minimize the expected AoI at the user's side. Specifically, assuming Poisson updating process for the servers and exponentially distributed response time, we derive a closed-form formula for computing the expected AoI and obtain the optimal number of responses to wait for to minimize the expected AoI. Then, we extend our analysis to the setting where the user aims to maximize the AoI-based utility, which represents the user's satisfaction level with respect to the freshness of the received information. Furthermore, we consider a more realistic scenario where the user has no prior knowledge of the system. In this case, we reformulate the utility maximization problem as a stochastic Multi-Armed Bandit problem with side observations and leverage a special linear structure of side observations to design learning algorithms with improved performance guarantees. Finally, we conduct extensive simulations to elucidate our theoretical results and compare the performance of different algorithms. Our findings reveal that under the Pull model, waiting does not necessarily lead to aging; waiting for more than one response can often significantly reduce the AoI and improve the AoI-based utility in most scenarios.  ( 3 min )
    C$^2$SP-Net: Joint Compression and Classification Network for Epilepsy Seizure Prediction. (arXiv:2110.13674v2 [cs.LG] UPDATED)
    Recent development in brain-machine interface technology has made seizure prediction possible. However, the communication of large volume of electrophysiological signals between sensors and processing apparatus and related computation become two major bottlenecks for seizure prediction systems due to the constrained bandwidth and limited computation resource, especially for wearable and implantable medical devices. Although compressive sensing (CS) can be adopted to compress the signals to reduce communication bandwidth requirement, it needs a complex reconstruction procedure before the signal can be used for seizure prediction. In this paper, we propose C$^2$SP-Net, to jointly solve compression, prediction, and reconstruction with a single neural network. A plug-and-play in-sensor compression matrix is constructed to reduce transmission bandwidth requirement. The compressed signal can be used for seizure prediction without additional reconstruction steps. Reconstruction of the original signal can also be carried out in high fidelity. Prediction accuracy, sensitivity, false prediction rate, and reconstruction quality of the proposed framework are evaluated under various compression ratios. The experimental results illustrate that our model outperforms the competitive state-of-the-art baselines by a large margin in prediction accuracy. In particular, our proposed method produces an average loss of 0.35 % in prediction accuracy with a compression ratio ranging from 1/2 to 1/16.  ( 2 min )
    Evaluating the Robustness of Deep Reinforcement Learning for Autonomous and Adversarial Policies in a Multi-agent Urban Driving Environment. (arXiv:2112.11947v2 [cs.AI] UPDATED)
    Deep reinforcement learning is actively used for training autonomous and adversarial car policies in a simulated driving environment. Due to the large availability of various reinforcement learning algorithms and the lack of their systematic comparison across different driving scenarios, we are unsure of which ones are more effective for training and testing autonomous car software in single-agent as well as multi-agent driving environments. A benchmarking framework for the comparison of deep reinforcement learning in a vision-based autonomous driving will open up the possibilities for training better autonomous car driving policies. Furthermore, autonomous cars trained on deep reinforcement learning-based algorithms are known for being vulnerable to adversarial attacks. To guard against adversarial attacks, we can train autonomous cars on adversarial driving policies. However, we lack the knowledge of which deep reinforcement learning algorithms would act as good adversarial agents able to effectively test autonomous cars. To address these challenges, we provide an open and reusable benchmarking framework for systematic evaluation and comparative analysis of deep reinforcement learning algorithms for autonomous and adversarial driving in a single- and multi-agent environment. Using the framework, we perform a comparative study of five discrete and two continuous action space deep reinforcement learning algorithms. We run the experiments in a vision-only high fidelity urban driving simulated environments. The results indicate that only some of the deep reinforcement learning algorithms perform consistently better across single and multi-agent scenarios when trained in a multi-agent-only setting.  ( 2 min )
    PSL is Dead. Long Live PSL. (arXiv:2205.14136v1 [cs.LG])
    Property Specification Language (PSL) is a form of temporal logic that has been mainly used in discrete domains (e.g. formal hardware verification). In this paper, we show that by merging machine learning techniques with PSL monitors, we can extend PSL to work on continuous domains. We apply this technique in machine learning-based anomaly detection to analyze scenarios of real-time streaming events from continuous variables in order to detect abnormal behaviors of a system. By using machine learning with formal models, we leverage the strengths of both machine learning methods and formal semantics of time. On one hand, machine learning techniques can produce distributions on continuous variables, where abnormalities can be captured as deviations from the distributions. On the other hand, formal methods can characterize discrete temporal behaviors and relations that cannot be easily learned by machine learning techniques. Interestingly, the anomalies detected by machine learning and the underlying time representation used are discrete events. We implemented a temporal monitoring package (TEF) that operates in conjunction with normal data science packages for anomaly detection machine learning systems, and we show that TEF can be used to perform accurate interpretation of temporal correlation between events.  ( 2 min )
    A Theoretical Understanding of Gradient Bias in Meta-Reinforcement Learning. (arXiv:2112.15400v2 [cs.LG] UPDATED)
    Gradient-based Meta-RL (GMRL) refers to methods that maintain two-level optimisation procedures wherein the outer-loop meta-learner guides the inner-loop gradient-based reinforcement learner to achieve fast adaptations. In this paper, we develop a unified framework that describes variations of GMRL algorithms and points out that existing stochastic meta-gradient estimators adopted by GMRL are actually \textbf{biased}. Such meta-gradient bias comes from two sources: 1) the compositional bias incurred by the two-level problem structure, which has an upper bound of $\mathcal{O}\big(K\alpha^{K}\hat{\sigma}_{\text{In}}|\tau|^{-0.5}\big)$ \emph{w.r.t.} inner-loop update step $K$, learning rate $\alpha$, estimate variance $\hat{\sigma}^{2}_{\text{In}}$ and sample size $|\tau|$, and 2) the multi-step Hessian estimation bias $\hat{\Delta}_{H}$ due to the use of autodiff, which has a polynomial impact $\mathcal{O}\big((K-1)(\hat{\Delta}_{H})^{K-1}\big)$ on the meta-gradient bias. We study tabular MDPs empirically and offer quantitative evidence that testifies our theoretical findings on existing stochastic meta-gradient estimators. Furthermore, we conduct experiments on Iterated Prisoner's Dilemma and Atari games to show how other methods such as off-policy learning and low-bias estimator can help fix the gradient bias for GMRL algorithms in general.  ( 2 min )
    Spelunking the Deep: Guaranteed Queries on General Neural Implicit Surfaces via Range Analysis. (arXiv:2202.02444v2 [cs.CV] UPDATED)
    Neural implicit representations, which encode a surface as the level set of a neural network applied to spatial coordinates, have proven to be remarkably effective for optimizing, compressing, and generating 3D geometry. Although these representations are easy to fit, it is not clear how to best evaluate geometric queries on the shape, such as intersecting against a ray or finding a closest point. The predominant approach is to encourage the network to have a signed distance property. However, this property typically holds only approximately, leading to robustness issues, and holds only at the conclusion of training, inhibiting the use of queries in loss functions. Instead, this work presents a new approach to perform queries directly on general neural implicit functions for a wide range of existing architectures. Our key tool is the application of range analysis to neural networks, using automatic arithmetic rules to bound the output of a network over a region; we conduct a study of range analysis on neural networks, and identify variants of affine arithmetic which are highly effective. We use the resulting bounds to develop geometric queries including ray casting, intersection testing, constructing spatial hierarchies, fast mesh extraction, closest-point evaluation, evaluating bulk properties, and more. Our queries can be efficiently evaluated on GPUs, and offer concrete accuracy guarantees even on randomly-initialized networks, enabling their use in training objectives and beyond. We also show a preliminary application to inverse rendering.  ( 2 min )
    A Two-Stage Federated Transfer Learning Framework in Medical Images Classification on Limited Data: A COVID-19 Case Study. (arXiv:2203.12803v2 [eess.IV] UPDATED)
    COVID-19 pandemic has spread rapidly and caused a shortage of global medical resources. The efficiency of COVID-19 diagnosis has become highly significant. As deep learning and convolutional neural network (CNN) has been widely utilized and been verified in analyzing medical images, it has become a powerful tool for computer-assisted diagnosis. However, there are two most significant challenges in medical image classification with the help of deep learning and neural networks, one of them is the difficulty of acquiring enough samples, which may lead to model overfitting. Privacy concerns mainly bring the other challenge since medical-related records are often deemed patients' private information and protected by laws such as GDPR and HIPPA. Federated learning can ensure the model training is decentralized on different devices and no data is shared among them, which guarantees privacy. However, with data located on different devices, the accessible data of each device could be limited. Since transfer learning has been verified in dealing with limited data with good performance, therefore, in this paper, We made a trial to implement federated learning and transfer learning techniques using CNNs to classify COVID-19 using lung CT scans. We also explored the impact of dataset distribution at the client-side in federated learning and the number of training epochs a model is trained. Finally, we obtained very high performance with federated learning, demonstrating our success in leveraging accuracy and privacy.  ( 3 min )
    A Comprehensive Survey on Radio Frequency (RF) Fingerprinting: Traditional Approaches, Deep Learning, and Open Challenges. (arXiv:2201.00680v2 [cs.LG] UPDATED)
    Fifth generation (5G) networks and beyond envisions massive Internet of Things (IoT) rollout to support disruptive applications such as extended reality (XR), augmented/virtual reality (AR/VR), industrial automation, autonomous driving, and smart everything which brings together massive and diverse IoT devices occupying the radio frequency (RF) spectrum. Along with spectrum crunch and throughput challenges, such a massive scale of wireless devices exposes unprecedented threat surfaces. RF fingerprinting is heralded as a candidate technology that can be combined with cryptographic and zero-trust security measures to ensure data privacy, confidentiality, and integrity in wireless networks. Motivated by the relevance of this subject in the future communication networks, in this work, we present a comprehensive survey of RF fingerprinting approaches ranging from a traditional view to the most recent deep learning (DL) based algorithms. Existing surveys have mostly focused on a constrained presentation of the wireless fingerprinting approaches, however, many aspects remain untold. In this work, however, we mitigate this by addressing every aspect - background on signal intelligence (SIGINT), applications, relevant DL algorithms, systematic literature review of RF fingerprinting techniques spanning the past two decades, discussion on datasets, and potential research avenues - necessary to elucidate this topic to the reader in an encyclopedic manner.  ( 2 min )
    Average Adjusted Association: Efficient Estimation with High Dimensional Confounders. (arXiv:2205.14048v1 [stat.ME])
    The log odds ratio is a common parameter to measure association between (binary) outcome and exposure variables. Much attention has been paid to its parametric but robust estimation, or its nonparametric estimation as a function of confounders. However, discussion on how to use a summary statistic by averaging the log odds ratio function is surprisingly difficult to find despite the popularity and importance of averaging in other contexts such as estimating the average treatment effect. We propose a couple of efficient double/debiased machine learning (DML) estimators of the average log odds ratio, where the odds ratios are adjusted for observed (potentially high dimensional) confounders and are averaged over them. The estimators are built from two equivalent forms of the efficient influence function. The first estimator uses a prospective probability of the outcome conditional on the exposure and confounders; the second one employs a retrospective probability of the exposure conditional on the outcome and confounders. Our framework encompasses random sampling as well as outcome-based or exposure-based sampling. Finally, we illustrate how to apply the proposed estimators using real data.
    Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language. (arXiv:2204.00598v2 [cs.CV] UPDATED)
    Large pretrained (e.g., "foundation") models exhibit distinct capabilities depending on the domain of data they are trained on. While these domains are generic, they may only barely overlap. For example, visual-language models (VLMs) are trained on Internet-scale image captions, but large language models (LMs) are further trained on Internet-scale text with no images (e.g., spreadsheets, SAT questions, code). As a result, these models store different forms of commonsense knowledge across different domains. In this work, we show that this diversity is symbiotic, and can be leveraged through Socratic Models (SMs): a modular framework in which multiple pretrained models may be composed zero-shot i.e., via multimodal-informed prompting, to exchange information with each other and capture new multimodal capabilities, without requiring finetuning. With minimal engineering, SMs are not only competitive with state-of-the-art zero-shot image captioning and video-to-text retrieval, but also enable new applications such as (i) answering free-form questions about egocentric video, (ii) engaging in multimodal assistive dialogue with people (e.g., for cooking recipes) by interfacing with external APIs and databases (e.g., web search), and (iii) robot perception and planning.
    Simple Unsupervised Object-Centric Learning for Complex and Naturalistic Videos. (arXiv:2205.14065v1 [cs.CV])
    Unsupervised object-centric learning aims to represent the modular, compositional, and causal structure of a scene as a set of object representations and thereby promises to resolve many critical limitations of traditional single-vector representations such as poor systematic generalization. Although there have been many remarkable advances in recent years, one of the most critical problems in this direction has been that previous methods work only with simple and synthetic scenes but not with complex and naturalistic images or videos. In this paper, we propose STEVE, an unsupervised model for object-centric learning in videos. Our proposed model makes a significant advancement by demonstrating its effectiveness on various complex and naturalistic videos unprecedented in this line of research. Interestingly, this is achieved by neither adding complexity to the model architecture nor introducing a new objective or weak supervision. Rather, it is achieved by a surprisingly simple architecture that uses a transformer-based image decoder conditioned on slots and the learning objective is simply to reconstruct the observation. Our experiment results on various complex and naturalistic videos show significant improvements compared to the previous state-of-the-art.
    FairCanary: Rapid Continuous Explainable Fairness. (arXiv:2106.07057v2 [cs.LG] UPDATED)
    Systems that offer continuous model monitoring have emerged in response to (1) well-documented failures of deployed Machine Learning (ML) and Artificial Intelligence (AI) models and (2) new regulatory requirements impacting these models. Existing monitoring systems continuously track the performance of deployed ML models and compute feature importance (a.k.a. explanations) for each prediction to help developers identify the root causes of emergent model performance problems. We present Quantile Demographic Drift (QDD), a novel model bias quantification metric that uses quantile binning to measure differences in the overall prediction distributions over subgroups. QDD is ideal for continuous monitoring scenarios, does not suffer from the statistical limitations of conventional threshold-based bias metrics, and does not require outcome labels (which may not be available at runtime). We incorporate QDD into a continuous model monitoring system, called FairCanary, that reuses existing explanations computed for each individual prediction to quickly compute explanations for the QDD bias metrics. This optimization makes FairCanary an order of magnitude faster than previous work that has tried to generate feature-level bias explanations.
    Supervised Training of Siamese Spiking Neural Networks with Earth Mover's Distance. (arXiv:2203.13207v2 [cs.NE] UPDATED)
    This study adapts the highly-versatile siamese neural network model to the event data domain. We introduce a supervised training framework for optimizing Earth Mover's Distance (EMD) between spike trains with spiking neural networks (SNN). We train this model on images of the MNIST dataset converted into spiking domain with novel conversion schemes. The quality of the siamese embeddings of input images was evaluated by measuring the classifier performance for different dataset coding types. The models achieved performance similar to existing SNN-based approaches (F1-score of up to 0.9386) while using only about 15% of hidden layer neurons to classify each example. Furthermore, models which did not employ a sparse neural code were about 45% slower than their sparse counterparts. These properties make the model suitable for low energy consumption and low prediction latency applications.
    VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts. (arXiv:2111.02358v2 [cs.CV] UPDATED)
    We present a unified Vision-Language pretrained Model (VLMo) that jointly learns a dual encoder and a fusion encoder with a modular Transformer network. Specifically, we introduce Mixture-of-Modality-Experts (MoME) Transformer, where each block contains a pool of modality-specific experts and a shared self-attention layer. Because of the modeling flexibility of MoME, pretrained VLMo can be fine-tuned as a fusion encoder for vision-language classification tasks, or used as a dual encoder for efficient image-text retrieval. Moreover, we propose a stagewise pre-training strategy, which effectively leverages large-scale image-only and text-only data besides image-text pairs. Experimental results show that VLMo achieves state-of-the-art results on various vision-language tasks, including VQA, NLVR2 and image-text retrieval. The code and pretrained models are available at https://aka.ms/vlmo.
    ES-GNN: Generalizing Graph Neural Networks Beyond Homophily with Edge Splitting. (arXiv:2205.13700v1 [cs.LG])
    Graph Neural Networks (GNNs) have achieved enormous success in tackling analytical problems on graph data. Most GNNs interpret nearly all the node connections as inductive bias with feature smoothness, and implicitly assume strong homophily on the observed graph. However, real-world networks are not always homophilic, but sometimes exhibit heterophilic patterns where adjacent nodes share dissimilar attributes and distinct labels. Therefore,GNNs smoothing the node proximity holistically may aggregate inconsistent information arising from both task-relevant and irrelevant connections. In this paper, we propose a novel edge splitting GNN (ES-GNN) framework, which generalizes GNNs beyond homophily by jointly partitioning network topology and disentangling node features. Specifically, the proposed framework employs an interpretable operation to adaptively split the set of edges of the original graph into two exclusive sets indicating respectively the task-relevant and irrelevant relations among nodes. The node features are then aggregated separately on these two partial edge sets to produce disentangled representations, based on which a more accurate edge splitting can be attained later. Theoretically, we show that our ES-GNN can be regarded as a solution to a graph denoising problem with a disentangled smoothness assumption, which further illustrates our motivations and interprets the improved generalization. Extensive experiments over 8 benchmark and 1 synthetic datasets demonstrate that ES-GNN not only outperforms the state-of-the-arts (including 8 GNN baselines), but also can be more robust to adversarial graphs and alleviate the over-smoothing problem.
    On the Sample Complexity of Decentralized Linear Quadratic Regulator with Partially Nested Information Structure. (arXiv:2110.07112v2 [math.OC] UPDATED)
    We study the problem of control policy design for decentralized state-feedback linear quadratic control with a partially nested information structure, when the system model is unknown. We propose a model-based learning solution, which consists of two steps. First, we estimate the unknown system model from a single system trajectory of finite length, using least squares estimation. Next, based on the estimated system model, we design a control policy that satisfies the desired information structure. We show that the suboptimality gap between our control policy and the optimal decentralized control policy (designed using accurate knowledge of the system model) scales linearly with the estimation error of the system model. Using this result, we provide an end-to-end sample complexity result for learning decentralized controllers for a linear quadratic control problem with a partially nested information structure.
    Meta-Learning Adversarial Bandits. (arXiv:2205.14128v1 [cs.LG])
    We study online learning with bandit feedback across multiple tasks, with the goal of improving average performance across tasks if they are similar according to some natural task-similarity measure. As the first to target the adversarial setting, we design a unified meta-algorithm that yields setting-specific guarantees for two important cases: multi-armed bandits (MAB) and bandit linear optimization (BLO). For MAB, the meta-algorithm tunes the initialization, step-size, and entropy parameter of the Tsallis-entropy generalization of the well-known Exp3 method, with the task-averaged regret provably improving if the entropy of the distribution over estimated optima-in-hindsight is small. For BLO, we learn the initialization, step-size, and boundary-offset of online mirror descent (OMD) with self-concordant barrier regularizers, showing that task-averaged regret varies directly with a measure induced by these functions on the interior of the action space. Our adaptive guarantees rely on proving that unregularized follow-the-leader combined with multiplicative weights is enough to online learn a non-smooth and non-convex sequence of affine functions of Bregman divergences that upper-bound the regret of OMD.  ( 2 min )
    Exploring Techniques for the Analysis of Spontaneous Asynchronicity in MPI-Parallel Applications. (arXiv:2205.13963v1 [cs.DC])
    This paper studies the utility of using data analytics and machine learning techniques for identifying, classifying, and characterizing the dynamics of large-scale parallel (MPI) programs. To this end, we run microbenchmarks and realistic proxy applications with the regular compute-communicate structure on two different supercomputing platforms and choose the per-process performance and MPI time per time step as relevant observables. Using principal component analysis, clustering techniques, correlation functions, and a new "phase space plot," we show how desynchronization patterns (or lack thereof) can be readily identified from a data set that is much smaller than a full MPI trace. Our methods also lead the way towards a more general classification of parallel program dynamics.  ( 2 min )
    Surrogate modeling for Bayesian optimization beyond a single Gaussian process. (arXiv:2205.14090v1 [stat.ML])
    Bayesian optimization (BO) has well-documented merits for optimizing black-box functions with an expensive evaluation cost. Such functions emerge in applications as diverse as hyperparameter tuning, drug discovery, and robotics. BO hinges on a Bayesian surrogate model to sequentially select query points so as to balance exploration with exploitation of the search space. Most existing works rely on a single Gaussian process (GP) based surrogate model, where the kernel function form is typically preselected using domain knowledge. To bypass such a design process, this paper leverages an ensemble (E) of GPs to adaptively select the surrogate model fit on-the-fly, yielding a GP mixture posterior with enhanced expressiveness for the sought function. Acquisition of the next evaluation input using this EGP-based function posterior is then enabled by Thompson sampling (TS) that requires no additional design parameters. To endow function sampling with scalability, random feature-based kernel approximation is leveraged per GP model. The novel EGP-TS readily accommodates parallel operation. To further establish convergence of the proposed EGP-TS to the global optimum, analysis is conducted based on the notion of Bayesian regret for both sequential and parallel settings. Tests on synthetic functions and real-world applications showcase the merits of the proposed method.
    What Dense Graph Do You Need for Self-Attention?. (arXiv:2205.14014v1 [cs.LG])
    Transformers have made progress in miscellaneous tasks, but suffer from quadratic computational and memory complexities. Recent works propose sparse Transformers with attention on sparse graphs to reduce complexity and remain strong performance. While effective, the crucial parts of how dense a graph needs to be to perform well are not fully explored. In this paper, we propose Normalized Information Payload (NIP), a graph scoring function measuring information transfer on graph, which provides an analysis tool for trade-offs between performance and complexity. Guided by this theoretical analysis, we present Hypercube Transformer, a sparse Transformer that models token interactions in a hypercube and shows comparable or even better results with vanilla Transformer while yielding $O(N\log N)$ complexity with sequence length $N$. Experiments on tasks requiring various sequence lengths lay validation for our graph function well.  ( 2 min )
    Does Momentum Change the Implicit Regularization on Separable Data?. (arXiv:2110.03891v2 [cs.LG] UPDATED)
    The momentum acceleration technique is widely adopted in many optimization algorithms. However, there is no theoretical answer on how the momentum affects the generalization performance of the optimization algorithms. This paper studies this problem by analyzing the implicit regularization of momentum-based optimization. We prove that on the linear classification problem with separable data and exponential-tailed loss, gradient descent with momentum (GDM) converges to the L2 max-margin solution, which is the same as vanilla gradient descent. That means gradient descent with momentum acceleration still converges to a low-complexity model, which guarantees their generalization. We then analyze the stochastic and adaptive variants of GDM (i.e., SGDM and deterministic Adam) and show they also converge to the L2 max-margin solution. Technically, to overcome the difficulty of the error accumulation in analyzing the momentum, we construct new potential functions to analyze the gap between the model parameter and the max-margin solution. Numerical experiments are conducted and support our theoretical results.  ( 2 min )
    Dynamic Domain Generalization. (arXiv:2205.13913v1 [cs.LG])
    Domain generalization (DG) is a fundamental yet very challenging research topic in machine learning. The existing arts mainly focus on learning domain-invariant features with limited source domains in a static model. Unfortunately, there is a lack of training-free mechanism to adjust the model when generalized to the agnostic target domains. To tackle this problem, we develop a brand-new DG variant, namely Dynamic Domain Generalization (DDG), in which the model learns to twist the network parameters to adapt the data from different domains. Specifically, we leverage a meta-adjuster to twist the network parameters based on the static model with respect to different data from different domains. In this way, the static model is optimized to learn domain-shared features, while the meta-adjuster is designed to learn domain-specific features. To enable this process, DomainMix is exploited to simulate data from diverse domains during teaching the meta-adjuster to adapt to the upcoming agnostic target domains. This learning mechanism urges the model to generalize to different agnostic target domains via adjusting the model without training. Extensive experiments demonstrate the effectiveness of our proposed method. Code is available at: https://github.com/MetaVisionLab/DDG  ( 2 min )
    Representing Polymers as Periodic Graphs with Learned Descriptors for Accurate Polymer Property Predictions. (arXiv:2205.13757v1 [cond-mat.mtrl-sci])
    One of the grand challenges of utilizing machine learning for the discovery of innovative new polymers lies in the difficulty of accurately representing the complex structures of polymeric materials. Although a wide array of hand-designed polymer representations have been explored, there has yet to be an ideal solution for how to capture the periodicity of polymer structures, and how to develop polymer descriptors without the need for human feature design. In this work, we tackle these problems through the development of our periodic polymer graph representation. Our pipeline for polymer property predictions is comprised of our polymer graph representation that naturally accounts for the periodicity of polymers, followed by a message-passing neural network (MPNN) that leverages the power of graph deep learning to automatically learn chemically-relevant polymer descriptors. Across a diverse dataset of 10 polymer properties, we find that this polymer graph representation consistently outperforms hand-designed representations with a 20% average reduction in prediction error. Our results illustrate how the incorporation of chemical intuition through directly encoding periodicity into our polymer graph representation leads to a considerable improvement in the accuracy and reliability of polymer property predictions. We also demonstrate how combining polymer graph representations with message-passing neural network architectures can automatically extract meaningful polymer features that are consistent with human intuition, while outperforming human-derived features. This work highlights the advancement in predictive capability that is possible if using chemical descriptors that are specifically optimized for capturing the unique chemical structure of polymers.
    Prototype Based Classification from Hierarchy to Fairness. (arXiv:2205.13997v1 [cs.LG])
    Artificial neural nets can represent and classify many types of data but are often tailored to particular applications -- e.g., for "fair" or "hierarchical" classification. Once an architecture has been selected, it is often difficult for humans to adjust models for a new task; for example, a hierarchical classifier cannot be easily transformed into a fair classifier that shields a protected field. Our contribution in this work is a new neural network architecture, the concept subspace network (CSN), which generalizes existing specialized classifiers to produce a unified model capable of learning a spectrum of multi-concept relationships. We demonstrate that CSNs reproduce state-of-the-art results in fair classification when enforcing concept independence, may be transformed into hierarchical classifiers, or even reconcile fairness and hierarchy within a single classifier. The CSN is inspired by existing prototype-based classifiers that promote interpretability.  ( 2 min )
    Double Deep Q Networks for Sensor Management in Space Situational Awareness. (arXiv:2205.14041v1 [cs.LG])
    We present a novel Double Deep Q Network (DDQN) application to a sensor management problem in space situational awareness (SSA). Frequent launches of satellites into Earth orbit pose a significant sensor management challenge, whereby a limited number of sensors are required to detect and track an increasing number of objects. In this paper, we demonstrate the use of reinforcement learning to develop a sensor management policy for SSA. We simulate a controllable Earth-based telescope, which is trained to maximise the number of satellites tracked using an extended Kalman filter. The estimated state covariance matrices for satellites observed under the DDQN policy are greatly reduced compared to those generated by an alternate (random) policy. This work provides the basis for further advancements and motivates the use of reinforcement learning for SSA.
    Robust Counterfactual Explanations for Random Forests. (arXiv:2205.14116v1 [cs.LG])
    Counterfactual explanations describe how to modify a feature vector in order to flip the outcome of a trained classifier. Several heuristic and optimal methods have been proposed to generate these explanations. However, the robustness of counterfactual explanations when the classifier is re-trained has yet to be studied. Our goal is to obtain counterfactual explanations for random forests that are robust to algorithmic uncertainty. We study the link between the robustness of ensemble models and the robustness of base learners and frame the generation of robust counterfactual explanations as a chance-constrained optimization problem. We develop a practical method with good empirical performance and provide finite-sample and asymptotic guarantees for simple random forests of stumps. We show that existing methods give surprisingly low robustness: the validity of naive counterfactuals is below $50\%$ on most data sets and can fall to $20\%$ on large problem instances with many features. Even with high plausibility, counterfactual explanations often exhibit low robustness to algorithmic uncertainty. In contrast, our method achieves high robustness with only a small increase in the distance from counterfactual explanations to their initial observations. Furthermore, we highlight the connection between the robustness of counterfactual explanations and the predictive importance of features.  ( 2 min )
    Towards a Unified Framework for Uncertainty-aware Nonlinear Variable Selection with Theoretical Guarantees. (arXiv:2204.07293v2 [stat.ML] UPDATED)
    We develop a simple and unified framework for nonlinear variable selection that incorporates uncertainty in the prediction function and is compatible with a wide range of machine learning models (e.g., tree ensembles, kernel methods, neural networks, etc). In particular, for a learned nonlinear model $f(\mathbf{x})$, we consider quantifying the importance of an input variable $\mathbf{x}^j$ using the integrated partial derivative $\Psi_j = \Vert \frac{\partial}{\partial \mathbf{x}^j} f(\mathbf{x})\Vert^2_{P_\mathcal{X}}$. We then (1) provide a principled approach for quantifying variable selection uncertainty by deriving its posterior distribution, and (2) show that the approach is generalizable even to non-differentiable models such as tree ensembles. Rigorous Bayesian nonparametric theorems are derived to guarantee the posterior consistency and asymptotic uncertainty of the proposed approach. Extensive simulations and experiments on healthcare benchmark datasets confirm that the proposed algorithm outperforms existing classic and recent variable selection methods.  ( 2 min )
    Semantic Exploration from Language Abstractions and Pretrained Representations. (arXiv:2204.05080v2 [cs.LG] UPDATED)
    Effective exploration is a challenge in reinforcement learning (RL). Novelty-based exploration methods can suffer in high-dimensional state spaces, such as continuous partially-observable 3D environments. We address this challenge by defining novelty using semantically meaningful state abstractions, which can be found in learned representations shaped by natural language. In particular, we evaluate vision-language representations, pretrained on natural image captioning datasets. We show that these pretrained representations drive meaningful, task-relevant exploration and improve performance on 3D simulated environments. We also characterize why and how language provides useful abstractions for exploration by considering the impacts of using representations from a pretrained model, a language oracle, and several ablations. We demonstrate the benefits of our approach in two very different task domains -- one that stresses the identification and manipulation of everyday objects, and one that requires navigational exploration in an expansive world -- as well as two popular deep RL algorithms: Impala and R2D2. Our results suggest that using language-shaped representations could improve exploration for various algorithms and agents in challenging environments.  ( 2 min )
    Big-means: Less is More for K-means Clustering. (arXiv:2204.07485v2 [cs.LG] UPDATED)
    K-means clustering plays a vital role in data mining. However, its performance drastically drops when applied to huge amounts of data. We propose a new heuristic that is built on the basis of regular K-means for faster and more accurate big data clustering using the "less is more" and decomposition approaches. The main advantage of the proposed algorithm is that it naturally turns the K-means local search into global one through the process of decomposition of the minimum sum-of-squares clustering (MSSC) problem. On one hand, decomposition of the MSSC problem into smaller subproblems reduces the computational complexity and allows for their parallel processing. On the other hand, the MSSC decomposition provides a new method for the natural data-driven shaking of the incumbent solution while introducing a new neighborhood structure for the solution of the MSSC problem. The proposed algorithm is scalable, fast, and accurate. The scalability of the algorithm can be easily adjusted by choosing the appropriate number of subproblems and their size. In our experiments it outperforms all recent state-of-the-art algorithms for the MSSC in both in time and the solution quality.  ( 2 min )
    TabNAS: Rejection Sampling for Neural Architecture Search on Tabular Datasets. (arXiv:2204.07615v2 [cs.LG] UPDATED)
    The best neural architecture for a given machine learning problem depends on many factors: not only the complexity and structure of the dataset, but also on resource constraints including latency, compute, energy consumption, etc. Neural architecture search (NAS) for tabular datasets is an important but under-explored problem. Previous NAS algorithms designed for image search spaces incorporate resource constraints directly into the reinforcement learning (RL) rewards. However, for NAS on tabular datasets, this protocol often discovers suboptimal architectures. This paper develops TabNAS, a new and more effective approach to handle resource constraints in tabular NAS using an RL controller motivated by the idea of rejection sampling. TabNAS immediately discards any architecture that violates the resource constraints without training or learning from that architecture. TabNAS uses a Monte-Carlo-based correction to the RL policy gradient update to account for this extra filtering step. Results on several tabular datasets demonstrate the superiority of TabNAS over previous reward-shaping methods: it finds better models that obey the constraints.  ( 2 min )
    Fairness and Welfare Quantification for Regret in Multi-Armed Bandits. (arXiv:2205.13930v1 [cs.LG])
    We extend the notion of regret with a welfarist perspective. Focussing on the classic multi-armed bandit (MAB) framework, the current work quantifies the performance of bandit algorithms by applying a fundamental welfare function, namely the Nash social welfare (NSW) function. This corresponds to equating algorithm's performance to the geometric mean of its expected rewards and leads us to the study of Nash regret, defined as the difference between the -- a priori unknown -- optimal mean (among the arms) and the algorithm's performance. Since NSW is known to satisfy fairness axioms, our approach complements the utilitarian considerations of average (cumulative) regret, wherein the algorithm is evaluated via the arithmetic mean of its expected rewards. This work develops an algorithm that, given the horizon of play $T$, achieves a Nash regret of $O \left( \sqrt{\frac{{k \log T}}{T}} \right)$, here $k$ denotes the number of arms in the MAB instance. Since, for any algorithm, the Nash regret is at least as much as its average regret (the AM-GM inequality), the known lower bound on average regret holds for Nash regret as well. Therefore, our Nash regret guarantee is essentially tight. In addition, we develop an anytime algorithm with a Nash regret guarantee of $O \left( \sqrt{\frac{{k\log T}}{T}} \log T \right)$.  ( 2 min )
    ProtoFSSL: Federated Semi-Supervised Learning with Prototype-based Consistency Regularization. (arXiv:2205.13921v1 [cs.LG])
    With the increasing computing power of edge devices, Federated Learning (FL) emerges to enable model training without privacy concerns. The majority of existing studies assume the data are fully labeled on the client side. In practice, however, the amount of labeled data is often limited. Recently, federated semi-supervised learning (FSSL) is explored as a way to effectively utilize unlabeled data during training. In this work, we propose ProtoFSSL, a novel FSSL approach based on prototypical networks. In ProtoFSSL, clients share knowledge with each other via lightweight prototypes, which prevents the local models from diverging. For computing loss on unlabeled data, each client creates accurate pseudo-labels based on shared prototypes. Jointly with labeled data, the pseudo-labels provide training signals for local prototypes. Compared to a FSSL approach based on weight sharing, the prototype-based inter-client knowledge sharing significantly reduces both communication and computation costs, enabling more frequent knowledge sharing between more clients for better accuracy. In multiple datasets, ProtoFSSL results in higher accuracy compared to the recent FSSL methods with and without knowledge sharing, such as FixMatch, FedRGD, and FedMatch. On SVHN dataset, ProtoFSSL performs comparably to fully supervised FL methods.  ( 2 min )
    GALAXY: Graph-based Active Learning at the Extreme. (arXiv:2202.01402v2 [cs.LG] UPDATED)
    Active learning is a label-efficient approach to train highly effective models while interactively selecting only small subsets of unlabelled data for labelling and training. In "open world" settings, the classes of interest can make up a small fraction of the overall dataset -- most of the data may be viewed as an out-of-distribution or irrelevant class. This leads to extreme class-imbalance, and our theory and methods focus on this core issue. We propose a new strategy for active learning called GALAXY (Graph-based Active Learning At the eXtrEme), which blends ideas from graph-based active learning and deep learning. GALAXY automatically and adaptively selects more class-balanced examples for labeling than most other methods for active learning. Our theory shows that GALAXY performs a refined form of uncertainty sampling that gathers a much more class-balanced dataset than vanilla uncertainty sampling. Experimentally, we demonstrate GALAXY's superiority over existing state-of-art deep active learning algorithms in unbalanced vision classification settings generated from popular datasets.  ( 2 min )
    Rethinking ValueDice: Does It Really Improve Performance?. (arXiv:2202.02468v2 [cs.LG] UPDATED)
    Since the introduction of GAIL, adversarial imitation learning (AIL) methods attract lots of research interests. Among these methods, ValueDice has achieved significant improvements: it beats the classical approach Behavioral Cloning (BC) under the offline setting, and it requires fewer interactions than GAIL under the online setting. Are these improvements benefited from more advanced algorithm designs? We answer this question by the following conclusions. First, we show that ValueDice could reduce to BC under the offline setting. Second, we verify that overfitting exists and regularization matters in the low-data regime. Specifically, we demonstrate that with weight decay, BC also nearly matches the expert performance as ValueDice does. The first two claims explain the superior offline performance of ValueDice. Third, we establish that ValueDice does not work when the expert trajectory is subsampled. Instead, the mentioned success of ValueDice holds when the expert trajectory is complete, in which ValueDice is closely related to BC that performs well as mentioned. Finally, we discuss the implications of our research for imitation learning studies beyond ValueDice.  ( 2 min )
    Minimax Regret for Cascading Bandits. (arXiv:2203.12577v2 [cs.LG] UPDATED)
    Cascading bandits is a natural and popular model that frames the task of learning to rank from Bernoulli click feedback in a bandit setting. For the case of unstructured rewards, we prove matching upper and lower bounds for the problem-independent (i.e., gap-free) regret, both of which strictly improve the best known. A key observation is that the hard instances of this problem are those with small mean rewards, i.e., the small click-through rates that are most relevant in practice. Based on this, and the fact that small mean implies small variance for Bernoullis, our key technical result shows that variance-aware confidence sets derived from the Bernstein and Chernoff bounds lead to optimal algorithms (up to log terms), whereas Hoeffding-based algorithms suffer order-wise suboptimal regret. This sharply contrasts with the standard (non-cascading) bandit setting, where the variance-aware algorithms only improve constants. In light of this and as an additional contribution, we propose a variance-aware algorithm for the structured case of linear rewards and show its regret strictly improves the state-of-the-art.  ( 2 min )
    Intelligent Transportation Systems' Orchestration: Lessons Learned & Potential Opportunities. (arXiv:2205.14040v1 [cs.NI])
    The growing deployment efforts of 5G networks globally has led to the acceleration of the businesses/services' digital transformation. This growth has led to the need for new communication technologies that will promote this transformation. 6G is being proposed as the set of technologies and architectures that will achieve this target. Among the main use cases that have emerged for 5G networks and will continue to play a pivotal role in 6G networks is that of Intelligent Transportation Systems (ITSs). With all the projected benefits of developing and deploying efficient and effective ITSs comes a group of unique challenges that need to be addressed. One prominent challenge is ITS orchestration due to the various supporting technologies and heterogeneous networks used to offer the desired ITS applications/services. To that end, this paper focuses on the ITS orchestration challenge in detail by highlighting the related previous works from the literature and listing the lessons learned from current ITS deployment orchestration efforts. It also presents multiple potential data-driven research opportunities in which paradigms such as reinforcement learning and federated learning can be deployed to offer effective and efficient ITS orchestration.
    TimeREISE: Time-series Randomized Evolving Input Sample Explanation. (arXiv:2202.07952v2 [cs.LG] UPDATED)
    Deep neural networks are one of the most successful classifiers across different domains. However, due to their limitations concerning interpretability their use is limited in safety critical context. The research field of explainable artificial intelligence addresses this problem. However, most of the interpretability methods are aligned to the image modality by design. The paper introduces TimeREISE a model agnostic attribution method specifically aligned to success in the context of time series classification. The method shows superior performance compared to existing approaches concerning different well-established measurements. TimeREISE is applicable to any time series classification network, its runtime does not scale in a linear manner concerning the input shape and it does not rely on prior data knowledge.  ( 2 min )
    Lifting the Information Ratio: An Information-Theoretic Analysis of Thompson Sampling for Contextual Bandits. (arXiv:2205.13924v1 [cs.LG])
    We study the Bayesian regret of the renowned Thompson Sampling algorithm in contextual bandits with binary losses and adversarially-selected contexts. We adapt the information-theoretic perspective of Russo and Van Roy [2016] to the contextual setting by introducing a new concept of information ratio based on the mutual information between the unknown model parameter and the observed loss. This allows us to bound the regret in terms of the entropy of the prior distribution through a remarkably simple proof, and with no structural assumptions on the likelihood or the prior. The extension to priors with infinite entropy only requires a Lipschitz assumption on the log-likelihood. An interesting special case is that of logistic bandits with d-dimensional parameters, K actions, and Lipschitz logits, for which we provide a $\widetilde{O}(\sqrt{dKT})$ regret upper-bound that does not depend on the smallest slope of the sigmoid link function.
    AANG: Automating Auxiliary Learning. (arXiv:2205.14082v1 [cs.LG])
    When faced with data-starved or highly complex end-tasks, it is commonplace for machine learning practitioners to introduce auxiliary objectives as supplementary learning signals. Whilst much work has been done to formulate useful auxiliary objectives, their construction is still an art which proceeds by slow and tedious hand-design. Intuitions about how and when these objectives improve end-task performance have also had limited theoretical backing. In this work, we present an approach for automatically generating a suite of auxiliary objectives. We achieve this by deconstructing existing objectives within a novel unified taxonomy, identifying connections between them, and generating new ones based on the uncovered structure. Next, we theoretically formalize widely-held intuitions about how auxiliary learning improves generalization of the end-task. This leads us to a principled and efficient algorithm for searching the space of generated objectives to find those most useful to a specified end-task. With natural language processing (NLP) as our domain of study, we empirically verify that our automated auxiliary learning pipeline leads to strong improvements over competitive baselines across continued training experiments on a pre-trained model on 5 NLP end-tasks.
    MyoSuite -- A contact-rich simulation suite for musculoskeletal motor control. (arXiv:2205.13600v1 [cs.RO])
    Embodied agents in continuous control domains have had limited exposure to tasks allowing to explore musculoskeletal properties that enable agile and nimble behaviors in biological beings. The sophistication behind neuro-musculoskeletal control can pose new challenges for the motor learning community. At the same time, agents solving complex neural control problems allow impact in fields such as neuro-rehabilitation, as well as collaborative-robotics. Human biomechanics underlies complex multi-joint-multi-actuator musculoskeletal systems. The sensory-motor system relies on a range of sensory-contact rich and proprioceptive inputs that define and condition muscle actuation required to exhibit intelligent behaviors in the physical world. Current frameworks for musculoskeletal control do not support physiological sophistication of the musculoskeletal systems along with physical world interaction capabilities. In addition, they are neither embedded in complex and skillful motor tasks nor are computationally effective and scalable to study large-scale learning paradigms. Here, we present MyoSuite -- a suite of physiologically accurate biomechanical models of elbow, wrist, and hand, with physical contact capabilities, which allow learning of complex and skillful contact-rich real-world tasks. We provide diverse motor-control challenges: from simple postural control to skilled hand-object interactions such as turning a key, twirling a pen, rotating two balls in one hand, etc. By supporting physiological alterations in musculoskeletal geometry (tendon transfer), assistive devices (exoskeleton assistance), and muscle contraction dynamics (muscle fatigue, sarcopenia), we present real-life tasks with temporal changes, thereby exposing realistic non-stationary conditions in our tasks which most continuous control benchmarks lack.
    Fast variable selection makes scalable Gaussian process BSS-ANOVA a speedy and accurate choice for tabular and time series regression. (arXiv:2205.13676v1 [cs.LG])
    Gaussian processes (GPs) are non-parametric regression engines with a long history. They are often overlooked in modern machine learning contexts because of scalability issues: regression for traditional GP kernels are $\mathcal{O}(N^3)$ where $N$ is the size of the dataset. One of a number of scalable GP approaches is the Karhunen-Lo\'eve (KL) decomposed kernel BSS-ANOVA, developed in 2009. It is $\mathcal{O}(NP)$ in training and $\mathcal{O}(P)$ per point in prediction, where $P$ is the number of terms in the ANOVA / KL expansion. A new method of forward variable selection, quickly and effectively limits the number of terms, yielding a method with competitive accuracies, training and inference times for large tabular datasets. The new algorithm balances model fidelity with model complexity using Bayesian and Akaike information criteria (BIC/AIC). The inference speed and accuracy makes the method especially useful for modeling dynamic systems in a model-free manner, by modeling the derivative in a dynamic system as a static problem, then integrating the learned dynamics using a high-order scheme. The methods are demonstrated on a `Susceptible, Infected, Recovered' (SIR) toy problem, with the transmissibility used as forcing function, along with the `Cascaded Tanks' benchmark dataset. Comparisons on the static prediction of derivatives are made with a Random Forest and Residual Neural Network, while for the timeseries prediction comparisons are made with LSTM and GRU recurrent neural networks. The GP outperforms the other methods in all modeling tasks on accuracy, while (in the case of the neural networks) performing many orders of magnitude fewer calculations. For the SIR test, which involved prediction for a set of forcing functions qualitatively different from those appearing in the training set, the GP captured the correct dynamics while the neural networks failed to do so.
    Learning Dense Reward with Temporal Variant Self-Supervision. (arXiv:2205.10431v2 [cs.LG] UPDATED)
    Rewards play an essential role in reinforcement learning. In contrast to rule-based game environments with well-defined reward functions, complex real-world robotic applications, such as contact-rich manipulation, lack explicit and informative descriptions that can directly be used as a reward. Previous effort has shown that it is possible to algorithmically extract dense rewards directly from multimodal observations. In this paper, we aim to extend this effort by proposing a more efficient and robust way of sampling and learning. In particular, our sampling approach utilizes temporal variance to simulate the fluctuating state and action distribution of a manipulation task. We then proposed a network architecture for self-supervised learning to better incorporate temporal information in latent representations. We tested our approach in two experimental setups, namely joint-assembly and door-opening. Preliminary results show that our approach is effective and efficient in learning dense rewards, and the learned rewards lead to faster convergence than baselines.
    Classification of Long Sequential Data using Circular Dilated Convolutional Neural Networks. (arXiv:2201.02143v2 [cs.LG] UPDATED)
    Classification of long sequential data is an important Machine Learning task and appears in many application scenarios. Recurrent Neural Networks, Transformers, and Convolutional Neural Networks are three major techniques for learning from sequential data. Among these methods, Temporal Convolutional Networks (TCNs) which are scalable to very long sequences have achieved remarkable progress in time series regression. However, the performance of TCNs for sequence classification is not satisfactory because they use a skewed connection protocol and output classes at the last position. Such asymmetry restricts their performance for classification which depends on the whole sequence. In this work, we propose a symmetric multi-scale architecture called Circular Dilated Convolutional Neural Network (CDIL-CNN), where every position has an equal chance to receive information from other positions at the previous layers. Our model gives classification logits in all positions, and we can apply a simple ensemble learning to achieve a better decision. We have tested CDIL-CNN on various long sequential datasets. The experimental results show that our method has superior performance over many state-of-the-art approaches.
    Quark: Controllable Text Generation with Reinforced Unlearning. (arXiv:2205.13636v1 [cs.CL])
    Large-scale language models often learn behaviors that are misaligned with user expectations. Generated text may contain offensive or toxic language, contain significant repetition, or be of a different sentiment than desired by the user. We consider the task of unlearning these misalignments by fine-tuning the language model on signals of what not to do. We introduce Quantized Reward Konditioning (Quark), an algorithm for optimizing a reward function that quantifies an (un)wanted property, while not straying too far from the original model. Quark alternates between (i) collecting samples with the current language model, (ii) sorting them into quantiles based on reward, with each quantile identified by a reward token prepended to the language model's input, and (iii) using a standard language modeling loss on samples from each quantile conditioned on its reward token, while remaining nearby the original language model via a KL-divergence penalty. By conditioning on a high-reward token at generation time, the model generates text that exhibits less of the unwanted property. For unlearning toxicity, negative sentiment, and repetition, our experiments show that Quark outperforms both strong baselines and state-of-the-art reinforcement learning methods like PPO (Schulman et al. 2017), while relying only on standard language modeling primitives.  ( 2 min )
    U-NO: U-shaped Neural Operators. (arXiv:2204.11127v2 [cs.LG] UPDATED)
    Neural operators generalize classical neural networks to maps between infinite-dimensional spaces, e.g. function spaces. Prior works on neural operators proposed a series of novel architectures to learn such maps and demonstrated unprecedented success in learning solution operators of partial differential equations. Due to their close proximity to fully connected architectures, these models mainly suffer from high memory usage and are generally limited to shallow deep learning models. In this paper, we propose U-shaped Neural Operator (U-NO), a U-shaped memory enhanced architecture that allows for deeper neural operators. U-NOs exploit the problem structures in function predictions and demonstrate fast training, data efficiency, and robustness with respect to hyperparameters choices. We study the performance of U-NO on PDE benchmarks, namely, Darcy's flow law and the Navier-Stokes equations. We show that U-NO results in an average of 14% and 34% prediction improvement on Darcy's flow and turbulent Navier-Stokes equations, respectively, over the state of art. On Navier-Stokes 3D spatio-temporal operator learning task, we show U-NO provides 40% improvement over the state of art methods.
    Auditing Differential Privacy in High Dimensions with the Kernel Quantum R\'enyi Divergence. (arXiv:2205.13941v1 [cs.LG])
    Differential privacy (DP) is the de facto standard for private data release and private machine learning. Auditing black-box DP algorithms and mechanisms to certify whether they satisfy a certain DP guarantee is challenging, especially in high dimension. We propose relaxations of differential privacy based on new divergences on probability distributions: the kernel R\'enyi divergence and its regularized version. We show that the regularized kernel R\'enyi divergence can be estimated from samples even in high dimensions, giving rise to auditing procedures for $\varepsilon$-DP, $(\varepsilon,\delta)$-DP and $(\alpha,\varepsilon)$-R\'enyi DP.
    Why Robust Generalization in Deep Learning is Difficult: Perspective of Expressive Power. (arXiv:2205.13863v1 [cs.LG])
    It is well-known that modern neural networks are vulnerable to adversarial examples. To mitigate this problem, a series of robust learning algorithms have been proposed. However, although the robust training error can be near zero via some methods, all existing algorithms lead to a high robust generalization error. In this paper, we provide a theoretical understanding of this puzzling phenomenon from the perspective of expressive power for deep neural networks. Specifically, for binary classification problems with well-separated data, we show that, for ReLU networks, while mild over-parameterization is sufficient for high robust training accuracy, there exists a constant robust generalization gap unless the size of the neural network is exponential in the data dimension $d$. Even if the data is linear separable, which means achieving low clean generalization error is easy, we can still prove an $\exp({\Omega}(d))$ lower bound for robust generalization. Moreover, we establish an improved upper bound of $\exp({\mathcal{O}}(k))$ for the network size to achieve low robust generalization error when the data lies on a manifold with intrinsic dimension $k$ ($k \ll d$). Nonetheless, we also have a lower bound that grows exponentially with respect to $k$ -- the curse of dimensionality is inevitable. By demonstrating an exponential separation between the network size for achieving low robust training and generalization error, our results reveal that the hardness of robust generalization may stem from the expressive power of practical models.
    Automated Dynamic Algorithm Configuration. (arXiv:2205.13881v1 [cs.AI])
    The performance of an algorithm often critically depends on its parameter configuration. While a variety of automated algorithm configuration methods have been proposed to relieve users from the tedious and error-prone task of manually tuning parameters, there is still a lot of untapped potential as the learned configuration is static, i.e., parameter settings remain fixed throughout the run. However, it has been shown that some algorithm parameters are best adjusted dynamically during execution, e.g., to adapt to the current part of the optimization landscape. Thus far, this is most commonly achieved through hand-crafted heuristics. A promising recent alternative is to automatically learn such dynamic parameter adaptation policies from data. In this article, we give the first comprehensive account of this new field of automated dynamic algorithm configuration (DAC), present a series of recent advances, and provide a solid foundation for future research in this field. Specifically, we (i) situate DAC in the broader historical context of AI research; (ii) formalize DAC as a computational problem; (iii) identify the methods used in prior-art to tackle this problem; (iv) conduct empirical case studies for using DAC in evolutionary optimization, AI planning, and machine learning.
    Near-Minimax Optimal Estimation With Shallow ReLU Neural Networks. (arXiv:2109.08844v2 [stat.ML] UPDATED)
    We study the problem of estimating an unknown function from noisy data using shallow ReLU neural networks. The estimators we study minimize the sum of squared data-fitting errors plus a regularization term proportional to the squared Euclidean norm of the network weights. This minimization corresponds to the common approach of training a neural network with weight decay. We quantify the performance (mean-squared error) of these neural network estimators when the data-generating function belongs to the second-order Radon-domain bounded variation space. This space of functions was recently proposed as the natural function space associated with shallow ReLU neural networks. We derive a minimax lower bound for the estimation problem for this function space and show that the neural network estimators are minimax optimal up to logarithmic factors. This minimax rate is immune to the curse of dimensionality. We quantify an explicit gap between neural networks and linear methods (which include kernel methods) by deriving a linear minimax lower bound for the estimation problem, showing that linear methods necessarily suffer the curse of dimensionality in this function space. As a result, this paper sheds light on the phenomenon that neural networks seem to break the curse of dimensionality.
    Neural Basis Models for Interpretability. (arXiv:2205.14120v1 [cs.LG])
    Due to the widespread use of complex machine learning models in real-world applications, it is becoming critical to explain model predictions. However, these models are typically black-box deep neural networks, explained post-hoc via methods with known faithfulness limitations. Generalized Additive Models (GAMs) are an inherently interpretable class of models that address this limitation by learning a non-linear shape function for each feature separately, followed by a linear model on top. However, these models are typically difficult to train, require numerous parameters, and are difficult to scale. We propose an entirely new subfamily of GAMs that utilizes basis decomposition of shape functions. A small number of basis functions are shared among all features, and are learned jointly for a given task, thus making our model scale much better to large-scale data with high-dimensional features, especially when features are sparse. We propose an architecture denoted as the Neural Basis Model (NBM) which uses a single neural network to learn these bases. On a variety of tabular and image datasets, we demonstrate that for interpretable machine learning, NBMs are the state-of-the-art in accuracy, model size, and, throughput and can easily model all higher-order feature interactions.
    Hybrid training of optical neural networks. (arXiv:2203.11207v2 [cs.LG] UPDATED)
    Optical neural networks are emerging as a promising type of machine learning hardware capable of energy-efficient, parallel computation. Today's optical neural networks are mainly developed to perform optical inference after in silico training on digital simulators. However, various physical imperfections that cannot be accurately modelled may lead to the notorious reality gap between the digital simulator and the physical system. To address this challenge, we demonstrate hybrid training of optical neural networks where the weight matrix is trained with neuron activation functions computed optically via forward propagation through the network. We examine the efficacy of hybrid training with three different networks: an optical linear classifier, a hybrid opto-electronic network, and a complex-valued optical network. We perform a comparative study to in silico training, and our results show that hybrid training is robust against different kinds of static noise. Our platform-agnostic hybrid training scheme can be applied to a wide variety of optical neural networks, and this work paves the way towards advanced all-optical training in machine intelligence.  ( 2 min )
    Generalization Bounds for Gradient Methods via Discrete and Continuous Prior. (arXiv:2205.13799v1 [cs.LG])
    Proving algorithm-dependent generalization error bounds for gradient-type optimization methods has attracted significant attention recently in learning theory. However, most existing trajectory-based analyses require either restrictive assumptions on the learning rate (e.g., fast decreasing learning rate), or continuous injected noise (such as the Gaussian noise in Langevin dynamics). In this paper, we introduce a new discrete data-dependent prior to the PAC-Bayesian framework, and prove a high probability generalization bound of order $O(\frac{1}{n}\cdot \sum_{t=1}^T(\gamma_t/\varepsilon_t)^2\left\|{\mathbf{g}_t}\right\|^2)$ for Floored GD (i.e. a version of gradient descent with precision level $\varepsilon_t$), where $n$ is the number of training samples, $\gamma_t$ is the learning rate at step $t$, $\mathbf{g}_t$ is roughly the difference of the gradient computed using all samples and that using only prior samples. $\left\|{\mathbf{g}_t}\right\|$ is upper bounded by and and typical much smaller than the gradient norm $\left\|{\nabla f(W_t)}\right\|$. We remark that our bound holds for nonconvex and nonsmooth scenarios. Moreover, our theoretical results provide numerically favorable upper bounds of testing errors (e.g., $0.037$ on MNIST). Using a similar technique, we can also obtain new generalization bounds for certain variants of SGD. Furthermore, we study the generalization bounds for gradient Langevin Dynamics (GLD). Using the same framework with a carefully constructed continuous prior, we show a new high probability generalization bound of order $O(\frac{1}{n} + \frac{L^2}{n^2}\sum_{t=1}^T(\gamma_t/\sigma_t)^2)$ for GLD. The new $1/n^2$ rate is due to the concentration of the difference between the gradient of training samples and that of the prior.
    Reinforcement Learning Approach for Mapping Applications to Dataflow-Based Coarse-Grained Reconfigurable Array. (arXiv:2205.13675v1 [cs.AR])
    The Streaming Engine (SE) is a Coarse-Grained Reconfigurable Array which provides programming flexibility and high-performance with energy efficiency. An application program to be executed on the SE is represented as a combination of Synchronous Data Flow (SDF) graphs, where every instruction is represented as a node. Each node needs to be mapped to the right slot and array in the SE to ensure the correct execution of the program. This creates an optimization problem with a vast and sparse search space for which finding a mapping manually is impractical because it requires expertise and knowledge of the SE micro-architecture. In this work we propose a Reinforcement Learning framework with Global Graph Attention (GGA) module and output masking of invalid placements to find and optimize instruction schedules. We use Proximal Policy Optimization in order to train a model which places operations into the SE tiles based on a reward function that models the SE device and its constraints. The GGA module consists of a graph neural network and an attention module. The graph neural network creates embeddings of the SDFs and the attention block is used to model sequential operation placement. We show results on how certain workloads are mapped to the SE and the factors affecting mapping quality. We find that the addition of GGA, on average, finds 10% better instruction schedules in terms of total clock cycles taken and masking improves reward obtained by 20%.  ( 2 min )
    EvenNet: Ignoring Odd-Hop Neighbors Improves Robustness of Graph Neural Networks. (arXiv:2205.13892v1 [cs.LG])
    Graph Neural Networks (GNNs) have received extensive research attention for their promising performance in graph machine learning. Despite their extraordinary predictive accuracy, existing approaches, such as GCN and GPRGNN, are not robust in the face of homophily changes on test graphs, rendering these models vulnerable to graph structural attacks and with limited capacity in generalizing to graphs of varied homophily levels. Although many methods have been proposed to improve the robustness of GNN models, most of these techniques are restricted to the spatial domain and employ complicated defense mechanisms, such as learning new graph structures or calculating edge attentions. In this paper, we study the problem of designing simple and robust GNN models in the spectral domain. We propose EvenNet, a spectral GNN corresponding to an even-polynomial graph filter. Based on our theoretical analysis in both spatial and spectral domains, we demonstrate that EvenNet outperforms full-order models in generalizing across homophilic and heterophilic graphs, implying that ignoring odd-hop neighbors improves the robustness of GNNs. We conduct experiments on both synthetic and real-world datasets to demonstrate the effectiveness of EvenNet. Notably, EvenNet outperforms existing defense models against structural attacks without introducing additional computational costs and maintains competitiveness in traditional node classification tasks on homophilic and heterophilic graphs.
    Scalable Interpretability via Polynomials. (arXiv:2205.14108v1 [cs.LG])
    Generalized Additive Models (GAMs) have quickly become the leading choice for fully-interpretable machine learning. However, unlike uninterpretable methods such as DNNs, they lack expressive power and easy scalability, and are hence not a feasible alternative for real-world tasks. We present a new class of GAMs that use tensor rank decompositions of polynomials to learn powerful, $\textit{fully-interpretable}$ models. Our approach, titled Scalable Polynomial Additive Models (SPAM) is effortlessly scalable and models $\textit{all}$ higher-order feature interactions without a combinatorial parameter explosion. SPAM outperforms all current interpretable approaches, and matches DNN/XGBoost performance on a series of real-world benchmarks with up to hundreds of thousands of features. We demonstrate by human subject evaluations that SPAMs are demonstrably more interpretable in practice, and are hence an effortless replacement for DNNs for creating interpretable and high-performance systems suitable for large-scale machine learning.
    Exploring Transformer Backbones for Heterogeneous Treatment Effect Estimation. (arXiv:2202.01336v4 [cs.LG] UPDATED)
    Neural networks (NNs) are often leveraged to represent structural similarities of potential outcomes (POs) of different treatment groups to obtain better finite-sample estimates of treatment effects. However, despite their wide use, existing works handcraft treatment-specific (sub)network architectures for representing various POs, which limit their applicability and generalizability. To remedy these issues, we develop a framework called Transformers as Treatment Effect Estimators (TransTEE) where attention layers govern interactions among treatments and covariates to exploit structural similarities of POs for confounding control. Using this framework, through extensive experiments, we show that TransTEE can: (1) serve as a general-purpose treatment effect estimator to significantly outperform competitive baselines on a variety of challenging TEE problems (e.g., discrete, continuous, structured, or dosage-associated treatments.) and is applicable both when covariates are tabular and when they consist of structural data (e.g., texts, graphs); (2) yield multiple advantages: compatibility with propensity score modeling, parameter efficiency, robustness to continuous treatment value distribution shifts, interpretability in covariate adjustment, and real-world utility in debugging pre-trained language models.
    Distributionally Robust Bayesian Optimization with $\phi$-divergences. (arXiv:2203.02128v2 [cs.LG] UPDATED)
    The study of robustness has received much attention due to its inevitability in data-driven settings where many systems face uncertainty. One such example of concern is Bayesian Optimization (BO), where uncertainty is multi-faceted, yet there only exists a limited number of works dedicated to this direction. In particular, there is the work of Kirschner et al. (2020), which bridges the existing literature of Distributionally Robust Optimization (DRO) by casting the BO problem from the lens of DRO. While this work is pioneering, it admittedly suffers from various practical shortcomings such as finite contexts assumptions, leaving behind the main question Can one devise a computationally tractable algorithm for solving this DRO-BO problem? In this work, we tackle this question to a large degree of generality by considering robustness against data-shift in $\phi$-divergences, which subsumes many popular choices, such as the $\chi^2$-divergence, Total Variation, and the extant Kullback-Leibler (KL) divergence. We show that the DRO-BO problem in this setting is equivalent to a finite-dimensional optimization problem which, even in the continuous context setting, can be easily implemented with provable sublinear regret bounds. We then show experimentally that our method surpasses existing methods, attesting to the theoretical results
    Efficient Forecasting of Large Scale Hierarchical Time Series via Multilevel Clustering. (arXiv:2205.14104v1 [cs.LG])
    We propose a novel approach to the problem of clustering hierarchically aggregated time-series data, which has remained an understudied problem though it has several commercial applications. We first group time series at each aggregated level, while simultaneously leveraging local and global information. The proposed method can cluster hierarchical time series (HTS) with different lengths and structures. For common two-level hierarchies, we employ a combined objective for local and global clustering over spaces of discrete probability measures, using Wasserstein distance coupled with Soft-DTW divergence. For multi-level hierarchies, we present a bottom-up procedure that progressively leverages lower-level information for higher-level clustering. Our final goal is to improve both the accuracy and speed of forecasts for a larger number of HTS needed for a real-world application. To attain this goal, each time series is first assigned the forecast for its cluster representative, which can be considered as a "shrinkage prior" for the set of time series it represents. Then this base forecast can be quickly fine-tuned to adjust to the specifics of that time series. We empirically show that our method substantially improves performance in terms of both speed and accuracy for large-scale forecasting tasks involving much HTS.
    A Multilabel Classification Framework for Approximate Nearest Neighbor Search. (arXiv:1910.08322v4 [cs.LG] UPDATED)
    Both supervised and unsupervised machine learning algorithms have been used to learn partition-based index structures for approximate nearest neighbor (ANN) search. Existing supervised algorithms formulate the learning task as finding a partition in which the nearest neighbors of a training set point belong to the same partition element as the point itself, so that the nearest neighbor candidates can be retrieved by naive lookup or backtracking search. We formulate candidate set selection in ANN search directly as a multilabel classification problem where the labels correspond to the nearest neighbors of the query point, and interpret the partitions as partitioning classifiers for solving this task. Empirical results suggest that the natural classifier based on this interpretation leads to strictly improved performance when combined with any unsupervised or supervised partitioning strategy. We also prove a sufficient condition for consistency of a partitioning classifier for ANN search, and illustrate the result by verifying this condition for chronological $k$-d trees.
    X-ViT: High Performance Linear Vision Transformer without Softmax. (arXiv:2205.13805v1 [cs.CV])
    Vision transformers have become one of the most important models for computer vision tasks. Although they outperform prior works, they require heavy computational resources on a scale that is quadratic to the number of tokens, $N$. This is a major drawback of the traditional self-attention (SA) algorithm. Here, we propose the X-ViT, ViT with a novel SA mechanism that has linear complexity. The main approach of this work is to eliminate nonlinearity from the original SA. We factorize the matrix multiplication of the SA mechanism without complicated linear approximation. By modifying only a few lines of code from the original SA, the proposed models outperform most transformer-based models on image classification and dense prediction tasks on most capacity regimes.
    Bayesian Robust Graph Contrastive Learning. (arXiv:2205.14109v1 [cs.LG])
    Graph Neural Networks (GNNs) have been widely used to learn node representations and with outstanding performance on various tasks such as node classification. However, noise, which inevitably exists in real-world graph data, would considerably degrade the performance of GNNs as the noise is easily propagated via the graph structure. In this work, we propose a novel and robust method, Bayesian Robust Graph Contrastive Learning (BRGCL), which trains a GNN encoder to learn robust node representations. The BRGCL encoder is a completely unsupervised encoder. Two steps are iteratively executed at each epoch of training the BRGCL encoder: (1) estimating confident nodes and computing robust cluster prototypes of node representations through a novel Bayesian nonparametric method; (2) prototypical contrastive learning between the node representations and the robust cluster prototypes. Experiments on public and large-scale benchmarks demonstrate the superior performance of BRGCL and the robustness of the learned node representations. The code of BRGCL is available at \url{https://github.com/BRGCL-code/BRGCL-code}.
    Safe Value Functions. (arXiv:2105.12204v3 [eess.SY] UPDATED)
    Safety constraints and optimality are important, but sometimes conflicting criteria for controllers. Although these criteria are often solved separately with different tools to maintain formal guarantees, it is also common practice in reinforcement learning to simply modify reward functions by penalizing failures, with the penalty treated as a mere heuristic. We rigorously examine the relationship of both safety and optimality to penalties, and formalize sufficient conditions for safe value functions: value functions that are both optimal for a given task, and enforce safety constraints. We reveal this structure by examining when rewards preserve viability under optimal control, and show that there always exists a finite penalty that induces a safe value function. This penalty is not unique, but upper-unbounded: larger penalties do not harm optimality. Although it is often not possible to compute the minimum required penalty, we reveal clear structure of how the penalty, rewards, discount factor, and dynamics interact. This insight suggests practical, theory-guided heuristics to design reward functions for control problems where safety is important.
    Approximating the Manifold Structure of Attributed Incentive Salience from Large Scale Behavioural Data. A Representation Learning Approach Based on Artificial Neural Networks. (arXiv:2108.01724v2 [cs.LG] UPDATED)
    Incentive salience attribution can be understood as a psychobiological mechanism ascribing relevance to potentially rewarding objects and actions. Despite being an important component of the motivational process guiding our everyday behaviour its study in naturalistic contexts is not straightforward. Here we propose a methodology based on artificial neural networks (ANNs) for approximating latent states produced by this process in situations where large volumes of behavioural data are available but no experimental control is possible. Leveraging knowledge derived from theoretical and computational accounts of incentive salience attribution we designed an ANN for estimating duration and intensity of future interactions between individuals and a series of video games in a large-scale ($N> 3 \times 10^6$) longitudinal dataset. We found video games to be the ideal context for developing such methodology due to their reliance on reward mechanics and their ability to provide ecologically robust behavioural measures at scale. When compared to competing approaches our methodology produces representations that are better suited for predicting the intensity future behaviour and approximating some functional properties of attributed incentive salience. We discuss our findings with reference to the adopted theoretical and computational frameworks and suggest how our methodology could be an initial step for estimating attributed incentive salience in large scale behavioural studies.
    A Model Predictive Control Functional Continuous Time Bayesian Network for Self-Management of Multiple Chronic Conditions. (arXiv:2205.13639v1 [cs.LG])
    Multiple chronic conditions (MCC) are one of the biggest challenges of modern times. The evolution of MCC follows a complex stochastic process that is influenced by a variety of risk factors, ranging from pre-existing conditions to modifiable lifestyle behavioral factors (e.g. diet, exercise habits, tobacco use, alcohol use, etc.) to non-modifiable socio-demographic factors (e.g., age, gender, education, marital status, etc.). People with MCC are at an increased risk of new chronic conditions and mortality. This paper proposes a model predictive control functional continuous time Bayesian network, an online recursive method to examine the impact of various lifestyle behavioral changes on the emergence trajectories of MCC and generate strategies to minimize the risk of progression of chronic conditions in individual patients. The proposed method is validated based on the Cameron county Hispanic cohort (CCHC) dataset, which has a total of 385 patients. The dataset examines the emergence of 5 chronic conditions (diabetes, obesity, cognitive impairment, hyperlipidemia, and hypertension) based on four modifiable risk factors representing lifestyle behaviors (diet, exercise habits, tobacco use, alcohol use) and four non-modifiable risk factors, including socio-demographic information (age, gender, education, marital status). The proposed method is tested under different scenarios (e.g., age group, the prior existence of MCC), demonstrating the effective intervention strategies for improving the lifestyle behavioral risk factors to offset MCC evolution.
    Deep Ensembles for Graphs with Higher-order Dependencies. (arXiv:2205.13988v1 [cs.LG])
    Graph neural networks (GNNs) continue to achieve state-of-the-art performance on many graph learning tasks, but rely on the assumption that a given graph is a sufficient approximation of the true neighborhood structure. In the presence of higher-order sequential dependencies, we show that the tendency of traditional graph representations to underfit each node's neighborhood causes existing GNNs to generalize poorly. To address this, we propose a novel Deep Graph Ensemble (DGE), which captures neighborhood variance by training an ensemble of GNNs on different neighborhood subspaces of the same node within a higher-order network structure. We show that DGE consistently outperforms existing GNNs on semisupervised and supervised tasks on four real-world data sets with known higher-order dependencies, even under a similar parameter budget. We demonstrate that learning diverse and accurate base classifiers is central to DGE's success, and discuss the implications of these findings for future work on GNNs.
    Anonymization for Skeleton Action Recognition. (arXiv:2111.15129v2 [cs.CV] UPDATED)
    Skeleton-based action recognition attracts practitioners and researchers due to the lightweight, compact nature of datasets. Compared with RGB-video-based action recognition, skeleton-based action recognition is a safer way to protect the privacy of subjects while having competitive recognition performance. However, due to improvements in skeleton estimation algorithms as well as motion- and depth-sensors, more details of motion characteristics can be preserved in the skeleton dataset, leading to potential privacy leakage. To investigate the potential privacy leakage from skeleton datasets, we first train a classifier to categorize sensitive private information from trajectories of joints. Our preliminary experiments show that the gender classifier achieves 87% accuracy on average and the re-identification task achieves 80% accuracy on average for three baseline models: Shift-GCN, MS-G3D, and 2s-AGCN. We propose an adversarial anonymization algorithm to protect potential privacy leakage from the skeleton dataset. Experimental results show that an anonymized dataset can reduce the risk of privacy leakage while having marginal effects on action recognition performance.
    ReVar: Strengthening Policy Evaluation via Reduced Variance Sampling. (arXiv:2203.04510v2 [cs.LG] UPDATED)
    This paper studies the problem of data collection for policy evaluation in Markov decision processes (MDPs). In policy evaluation, we are given a target policy and asked to estimate the expected cumulative reward it will obtain in an environment formalized as an MDP. We develop theory for optimal data collection within the class of tree-structured MDPs by first deriving an oracle data collection strategy that uses knowledge of the variance of the reward distributions. We then introduce the Reduced Variance Sampling (ReVar) algorithm that approximates the oracle strategy when the reward variances are unknown a priori and bound its sub-optimality compared to the oracle strategy. Finally, we empirically validate that ReVar leads to policy evaluation with mean squared error comparable to the oracle strategy and significantly lower than simply running the target policy.
    MissDAG: Causal Discovery in the Presence of Missing Data with Continuous Additive Noise Models. (arXiv:2205.13869v1 [cs.LG])
    State-of-the-art causal discovery methods usually assume that the observational data is complete. However, the missing data problem is pervasive in many practical scenarios such as clinical trials, economics, and biology. One straightforward way to address the missing data problem is first to impute the data using off-the-shelf imputation methods and then apply existing causal discovery methods. However, such a two-step method may suffer from suboptimality, as the imputation algorithm is unaware of the causal discovery step. In this paper, we develop a general method, which we call MissDAG, to perform causal discovery from data with incomplete observations. Focusing mainly on the assumptions of ignorable missingness and the identifiable additive noise models (ANMs), MissDAG maximizes the expected likelihood of the visible part of observations under the expectation-maximization (EM) framework. In the E-step, in cases where computing the posterior distributions of parameters in closed-form is not feasible, Monte Carlo EM is leveraged to approximate the likelihood. In the M-step, MissDAG leverages the density transformation to model the noise distributions with simpler and specific formulations by virtue of the ANMs and uses a likelihood-based causal discovery algorithm with directed acyclic graph prior as an inductive bias. We demonstrate the flexibility of MissDAG for incorporating various causal discovery algorithms and its efficacy through extensive simulations and real data experiments.
    How Tempering Fixes Data Augmentation in Bayesian Neural Networks. (arXiv:2205.13900v1 [cs.LG])
    While Bayesian neural networks (BNNs) provide a sound and principled alternative to standard neural networks, an artificial sharpening of the posterior usually needs to be applied to reach comparable performance. This is in stark contrast to theory, dictating that given an adequate prior and a well-specified model, the untempered Bayesian posterior should achieve optimal performance. Despite the community's extensive efforts, the observed gains in performance still remain disputed with several plausible causes pointing at its origin. While data augmentation has been empirically recognized as one of the main drivers of this effect, a theoretical account of its role, on the other hand, is largely missing. In this work we identify two interlaced factors concurrently influencing the strength of the cold posterior effect, namely the correlated nature of augmentations and the degree of invariance of the employed model to such transformations. By theoretically analyzing simplified settings, we prove that tempering implicitly reduces the misspecification arising from modeling augmentations as i.i.d. data. The temperature mimics the role of the effective sample size, reflecting the gain in information provided by the augmentations. We corroborate our theoretical findings with extensive empirical evaluations, scaling to realistic BNNs. By relying on the framework of group convolutions, we experiment with models of varying inherent degree of invariance, confirming its hypothesized relationship with the optimal temperature.
    Towards Interpretable Natural Language Understanding with Explanations as Latent Variables. (arXiv:2011.05268v3 [cs.CL] UPDATED)
    Recently generating natural language explanations has shown very promising results in not only offering interpretable explanations but also providing additional information and supervision for prediction. However, existing approaches usually require a large set of human annotated explanations for training while collecting a large set of explanations is not only time consuming but also expensive. In this paper, we develop a general framework for interpretable natural language understanding that requires only a small set of human annotated explanations for training. Our framework treats natural language explanations as latent variables that model the underlying reasoning process of a neural model. We develop a variational EM framework for optimization where an explanation generation module and an explanation-augmented prediction module are alternatively optimized and mutually enhance each other. Moreover, we further propose an explanation-based self-training method under this framework for semi-supervised learning. It alternates between assigning pseudo-labels to unlabeled data and generating new explanations to iteratively improve each other. Experiments on two natural language understanding tasks demonstrate that our framework can not only make effective predictions in both supervised and semi-supervised settings, but also generate good natural language explanation.
    Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation. (arXiv:2205.14141v1 [cs.CV])
    Masked image modeling (MIM) learns representations with remarkably good fine-tuning performances, overshadowing previous prevalent pre-training approaches such as image classification, instance contrastive learning, and image-text alignment. In this paper, we show that the inferior fine-tuning performance of these pre-training approaches can be significantly improved by a simple post-processing in the form of feature distillation (FD). The feature distillation converts the old representations to new representations that have a few desirable properties just like those representations produced by MIM. These properties, which we aggregately refer to as optimization friendliness, are identified and analyzed by a set of attention- and optimization-related diagnosis tools. With these properties, the new representations show strong fine-tuning performance. Specifically, the contrastive self-supervised learning methods are made as competitive in fine-tuning as the state-of-the-art masked image modeling (MIM) algorithms. The CLIP models' fine-tuning performance is also significantly improved, with a CLIP ViT-L model reaching 89.0% top-1 accuracy on ImageNet-1K classification. More importantly, our work provides a way for the future research to focus more effort on the generality and scalability of the learnt representations without being pre-occupied with optimization friendliness since it can be enhanced rather easily. The code will be available at https://github.com/SwinTransformer/Feature-Distillation.
    AsyncFedED: Asynchronous Federated Learning with Euclidean Distance based Adaptive Weight Aggregation. (arXiv:2205.13797v1 [cs.LG])
    In an asynchronous federated learning framework, the server updates the global model once it receives an update from a client instead of waiting for all the updates to arrive as in the synchronous setting. This allows heterogeneous devices with varied computing power to train the local models without pausing, thereby speeding up the training process. However, it introduces the stale model problem, where the newly arrived update was calculated based on a set of stale weights that are older than the current global model, which may hurt the convergence of the model. In this paper, we present an asynchronous federated learning framework with a proposed adaptive weight aggregation algorithm, referred to as AsyncFedED. To the best of our knowledge this aggregation method is the first to take the staleness of the arrived gradients, measured by the Euclidean distance between the stale model and the current global model, and the number of local epochs that have been performed, into account. Assuming general non-convex loss functions, we prove the convergence of the proposed method theoretically. Numerical results validate the effectiveness of the proposed AsyncFedED in terms of the convergence rate and model accuracy compared to the existing methods for three considered tasks.
    Raising the Bar in Graph-level Anomaly Detection. (arXiv:2205.13845v1 [cs.LG])
    Graph-level anomaly detection has become a critical topic in diverse areas, such as financial fraud detection and detecting anomalous activities in social networks. While most research has focused on anomaly detection for visual data such as images, where high detection accuracies have been obtained, existing deep learning approaches for graphs currently show considerably worse performance. This paper raises the bar on graph-level anomaly detection, i.e., the task of detecting abnormal graphs in a set of graphs. By drawing on ideas from self-supervised learning and transformation learning, we present a new deep learning approach that significantly improves existing deep one-class approaches by fixing some of their known problems, including hypersphere collapse and performance flip. Experiments on nine real-world data sets involving nine techniques reveal that our method achieves an average performance improvement of 11.8% AUC compared to the best existing approach.
    RecipeRec: A Heterogeneous Graph Learning Model for Recipe Recommendation. (arXiv:2205.14005v1 [cs.IR])
    Recipe recommendation systems play an essential role in helping people decide what to eat. Existing recipe recommendation systems typically focused on content-based or collaborative filtering approaches, ignoring the higher-order collaborative signal such as relational structure information among users, recipes and food items. In this paper, we formalize the problem of recipe recommendation with graphs to incorporate the collaborative signal into recipe recommendation through graph modeling. In particular, we first present URI-Graph, a new and large-scale user-recipe-ingredient graph. We then propose RecipeRec, a novel heterogeneous graph learning model for recipe recommendation. The proposed model can capture recipe content and collaborative signal through a heterogeneous graph neural network with hierarchical attention and an ingredient set transformer. We also introduce a graph contrastive augmentation strategy to extract informative graph knowledge in a self-supervised manner. Finally, we design a joint objective function of recommendation and contrastive learning to optimize the model. Extensive experiments demonstrate that RecipeRec outperforms state-of-the-art methods for recipe recommendation. Dataset and codes are available at https://github.com/meettyj/RecipeRec.
    Maximum Likelihood Training of Implicit Nonlinear Diffusion Models. (arXiv:2205.13699v1 [cs.LG])
    Whereas diverse variations of diffusion models exist, expanding the linear diffusion into a nonlinear diffusion process is investigated only by a few works. The nonlinearity effect has been hardly understood, but intuitively, there would be more promising diffusion patterns to optimally train the generative distribution towards the data distribution. This paper introduces such a data-adaptive and nonlinear diffusion process for score-based diffusion models. The proposed Implicit Nonlinear Diffusion Model (INDM) learns the nonlinear diffusion process by combining a normalizing flow and a diffusion process. Specifically, INDM implicitly constructs a nonlinear diffusion on the \textit{data space} by leveraging a linear diffusion on the \textit{latent space} through a flow network. This flow network is the key to forming a nonlinear diffusion as the nonlinearity fully depends on the flow network. This flexible nonlinearity is what improves the learning curve of INDM to nearly MLE training, compared against the non-MLE training of DDPM++, which turns out to be a special case of INDM with the identity flow. Also, training the nonlinear diffusion empirically yields a sampling-friendly latent diffusion that the sample trajectory of INDM is closer to an optimal transport than the trajectories of previous research. In experiments, INDM achieves the state-of-the-art FID on CelebA.
    CIGMO: Categorical invariant representations in a deep generative framework. (arXiv:2205.13758v1 [cs.CV])
    Data of general object images have two most common structures: (1) each object of a given shape can be rendered in multiple different views, and (2) shapes of objects can be categorized in such a way that the diversity of shapes is much larger across categories than within a category. Existing deep generative models can typically capture either structure, but not both. In this work, we introduce a novel deep generative model, called CIGMO, that can learn to represent category, shape, and view factors from image data. The model is comprised of multiple modules of shape representations that are each specialized to a particular category and disentangled from view representation, and can be learned using a group-based weakly supervised learning method. By empirical investigation, we show that our model can effectively discover categories of object shapes despite large view variation and quantitatively supersede various previous methods including the state-of-the-art invariant clustering algorithm. Further, we show that our approach using category-specialization can enhance the learned shape representation to better perform down-stream tasks such as one-shot object identification as well as shape-view disentanglement.
    Generative Flows as a General Purpose Solution for Inverse Problems. (arXiv:2110.13285v3 [cs.CV] UPDATED)
    Due to the success of generative flows to model data distributions, they have been explored in inverse problems. Given a pre-trained generative flow, previous work proposed to minimize the 2-norm of the latent variables as a regularization term. The intuition behind it was to ensure high likelihood latent variables that produce the closest restoration. However, high-likelihood latent variables may generate unrealistic samples as we show in our experiments. We therefore propose a solver to directly produce high-likelihood reconstructions. We hypothesize that our approach could make generative flows a general purpose solver for inverse problems. Furthermore, we propose 1 x 1 coupling functions to introduce permutations in a generative flow. It has the advantage that its inverse does not require to be calculated in the generation process. Finally, we evaluate our method for denoising, deblurring, inpainting, and colorization. We observe a compelling improvement of our method over prior works.
    Deep Coding Patterns Design for Compressive Near-Infrared Spectral Classification. (arXiv:2205.14069v1 [cs.LG])
    Compressive spectral imaging (CSI) has emerged as an attractive compression and sensing technique, primarily to sense spectral regions where traditional systems result in highly costly such as in the near-infrared spectrum. Recently, it has been shown that spectral classification can be performed directly in the compressive domain, considering the amount of spectral information embedded in the measurements, skipping the reconstruction step. Consequently, the classification quality directly depends on the set of coding patterns employed in the sensing step. Therefore, this work proposes an end-to-end approach to jointly design the coding patterns used in CSI and the network parameters to perform spectral classification directly from the embedded near-infrared compressive measurements. Extensive simulation on the three-dimensional coded aperture snapshot spectral imaging (3D-CASSI) system validates that the proposed design outperforms traditional and random design in up to 10% of classification accuracy.
    Sharpness-Aware Training for Free. (arXiv:2205.14083v1 [cs.LG])
    Modern deep neural networks (DNNs) have achieved state-of-the-art performances but are typically over-parameterized. The over-parameterization may result in undesirably large generalization error in the absence of other customized training strategies. Recently, a line of research under the name of Sharpness-Aware Minimization (SAM) has shown that minimizing a sharpness measure, which reflects the geometry of the loss landscape, can significantly reduce the generalization error. However, SAM-like methods incur a two-fold computational overhead of the given base optimizer (e.g. SGD) for approximating the sharpness measure. In this paper, we propose Sharpness-Aware Training for Free, or SAF, which mitigates the sharp landscape at almost zero additional computational cost over the base optimizer. Intuitively, SAF achieves this by avoiding sudden drops in the loss in the sharp local minima throughout the trajectory of the updates of the weights. Specifically, we suggest a novel trajectory loss, based on the KL-divergence between the outputs of DNNs with the current weights and past weights, as a replacement of the SAM's sharpness measure. This loss captures the rate of change of the training loss along the model's update trajectory. By minimizing it, SAF ensures the convergence to a flat minimum with improved generalization capabilities. Extensive empirical results show that SAF minimizes the sharpness in the same way that SAM does, yielding better results on the ImageNet dataset with essentially the same computational cost as the base optimizer.
    Benchpress: A Scalable and Versatile Workflow for Benchmarking Structure Learning Algorithms. (arXiv:2107.03863v3 [stat.ML] UPDATED)
    Describing the relationship between the variables in a study domain and modelling the data generating mechanism is a fundamental problem in many empirical sciences. Probabilistic graphical models are one common approach to tackle the problem. Learning the graphical structure for such models is computationally challenging and a fervent area of current research with a plethora of algorithms being developed. To facilitate the benchmarking of different methods, we present a novel Snakemake workflow, called Benchpress for producing scalable, reproducible, and platform-independent benchmarks of structure learning algorithms for probabilistic graphical models. Benchpress is interfaced via a simple JSON-file, which makes it accessible for all users, while the code is designed in a fully modular fashion to enable researchers to contribute additional methodologies. Benchpress currently provides an interface to a large number of state-of-the-art algorithms from libraries such as BDgraph, BiDAG, bnlearn, gCastle, GOBNILP, pcalg, r.blip, scikit-learn, TETRAD, and trilearn as well as a variety of methods for data generating models and performance evaluation. Alongside user-defined models and randomly generated datasets, the workflow also includes a number of standard datasets and graphical models from the literature, which may be included in a benchmarking study. We demonstrate the applicability of this workflow for learning Bayesian networks in five typical data scenarios. The source code and documentation is publicly available from this http URL
    Text-Based Automatic Personality Prediction Using KGrAt-Net; A Knowledge Graph Attention Network Classifier. (arXiv:2205.13780v1 [cs.CL])
    Nowadays, a tremendous amount of human communications take place on the Internet-based communication infrastructures, like social networks, email, forums, organizational communication platforms, etc. Indeed, the automatic prediction or assessment of individuals' personalities through their written or exchanged text would be advantageous to ameliorate the relationships among them. To this end, this paper aims to propose KGrAt-Net which is a Knowledge Graph Attention Network text classifier. For the first time, it applies the knowledge graph attention network to perform Automatic Personality Prediction (APP), according to the Big Five personality traits. After performing some preprocessing activities, first, it tries to acquire a knowingful representation of the knowledge behind the concepts in the input text through building its equivalent knowledge graph. A knowledge graph is a graph-based data model that formally represents the semantics of the existing concepts in the input text and models the knowledge behind them. Then, applying the attention mechanism, it efforts to pay attention to the most relevant parts of the graph to predict the personality traits of the input text. The results demonstrated that KGrAt-Net considerably improved the personality prediction accuracies. Furthermore, KGrAt-Net also uses the knowledge graphs' embeddings to enrich the classification, which makes it even more accurate in APP.
    Client Selection in Nonconvex Federated Learning: Improved Convergence Analysis for Optimal Unbiased Sampling Strategy. (arXiv:2205.13925v1 [cs.LG])
    Federated learning (FL) is a distributed machine learning paradigm that selects a subset of clients to participate in training to reduce communication burdens. However, partial client participation in FL causes \emph{objective inconsistency}, which can hinder the convergence, while this objective inconsistency has not been analyzed in existing studies on sampling methods. To tackle this issue, we propose an improved analysis method that focuses on the convergence behavior of the practical participated client's objective. Moreover, based on our convergence analysis, we give a novel unbiased sampling strategy, i.e., FedSRC-D, whose sampling probability is proportional to the client's gradient diversity and local variance. FedSRC-D is provable the optimal unbiased sampling in non-convex settings for non-IID FL with respect to the given bounds. Specifically, FedSRC-D achieves $\mathop{O}(\frac{G^2}{\epsilon^2}+\frac{1}{\epsilon^{2/3}})$ higher than SOTA convergence rate of FedAvg, and $\mathop{O}(\frac{G^2}{\epsilon^2})$ higher than other unbiased sampling methods. We corroborate our results with experiments on both synthetic and real data sets.
    Guided Exploration of Data Summaries. (arXiv:2205.13956v1 [cs.LG])
    Data summarization is the process of producing interpretable and representative subsets of an input dataset. It is usually performed following a one-shot process with the purpose of finding the best summary. A useful summary contains k individually uniform sets that are collectively diverse to be representative. Uniformity addresses interpretability and diversity addresses representativity. Finding such as summary is a difficult task when data is highly diverse and large. We examine the applicability of Exploratory Data Analysis (EDA) to data summarization and formalize Eda4Sum, the problem of guided exploration of data summaries that seeks to sequentially produce connected summaries with the goal of maximizing their cumulative utility. EdA4Sum generalizes one-shot summarization. We propose to solve it with one of two approaches: (i) Top1Sum which chooses the most useful summary at each step; (ii) RLSum which trains a policy with Deep Reinforcement Learning that rewards an agent for finding a diverse and new collection of uniform sets at each step. We compare these approaches with one-shot summarization and top-performing EDA solutions. We run extensive experiments on three large datasets. Our results demonstrate the superiority of our approaches for summarizing very large data, and the need to provide guidance to domain experts.
    Contrastive Siamese Network for Semi-supervised Speech Recognition. (arXiv:2205.14054v1 [cs.LG])
    This paper introduces contrastive siamese (c-siam) network, an architecture for leveraging unlabeled acoustic data in speech recognition. c-siam is the first network that extracts high-level linguistic information from speech by matching outputs of two identical transformer encoders. It contains augmented and target branches which are trained by: (1) masking inputs and matching outputs with a contrastive loss, (2) incorporating a stop gradient operation on the target branch, (3) using an extra learnable transformation on the augmented branch, (4) introducing new temporal augment functions to prevent the shortcut learning problem. We use the Libri-light 60k unsupervised data and the LibriSpeech 100hrs/960hrs supervised data to compare c-siam and other best-performing systems. Our experiments show that c-siam provides 20% relative word error rate improvement over wav2vec baselines. A c-siam network with 450M parameters achieves competitive results compared to the state-of-the-art networks with 600M parameters.
    Generalizing Brain Decoding Across Subjects with Deep Learning. (arXiv:2205.14102v1 [cs.LG])
    Decoding experimental variables from brain imaging data is gaining popularity, with applications in brain-computer interfaces and the study of neural representations. Decoding is typically subject-specific and does not generalise well over subjects. Here, we investigate ways to achieve cross-subject decoding. We used magnetoencephalography (MEG) data where 15 subjects viewed 118 different images, with 30 examples per image. Training on the entire 1s window following the presentation of each image, we experimented with an adaptation of the WaveNet architecture for classification. We also investigated the use of subject embedding to aid learning of subject variability in the group model. We show that deep learning and subject embedding are crucial to closing the performance gap between subject and group-level models. Importantly group models outperform subject models when tested on an unseen subject with little available data. The potential of such group modelling is even higher with bigger datasets. Furthermore, we demonstrate the use of permutation feature importance to gain insight into the spatio-temporal and spectral information encoded in the models, enabling better physiological interpretation. All experimental code is available at https://github.com/ricsinaruto/MEG-group-decode.
    Finite mixture of skewed sub-Gaussian stable distributions. (arXiv:2205.14067v1 [stat.ME])
    We propose the finite mixture of skewed sub-Gaussian stable distributions. The maximum likelihood estimator for the parameters of proposed finite mixture model is computed through the expectation-maximization algorithm. The proposed model contains the finite mixture of normal and skewed normal distributions. Since the tails of proposed model is heavier than even the Student's t distribution, it can be used as a powerful model for robust model-based clustering. Performance of the proposed model is demonstrated by clustering simulation data and two sets of real data.
    Capturing Graphs with Hypo-Elliptic Diffusions. (arXiv:2205.14092v1 [cs.LG])
    Convolutional layers within graph neural networks operate by aggregating information about local neighbourhood structures; one common way to encode such substructures is through random walks. The distribution of these random walks evolves according to a diffusion equation defined using the graph Laplacian. We extend this approach by leveraging classic mathematical results about hypo-elliptic diffusions. This results in a novel tensor-valued graph operator, which we call the hypo-elliptic graph Laplacian. We provide theoretical guarantees and efficient low-rank approximation algorithms. In particular, this gives a structured approach to capture long-range dependencies on graphs that is robust to pooling. Besides the attractive theoretical properties, our experiments show that this method competes with graph transformers on datasets requiring long-range reasoning but scales only linearly in the number of edges as opposed to quadratically in nodes.
    Towards Handling Uncertainty-at-Source in AI -- A Review and Next Steps for Interval Regression. (arXiv:2104.07245v2 [cs.LG] UPDATED)
    Most of statistics and AI draw insights through modelling discord or variance between sources of information (i.e., inter-source uncertainty). Increasingly, however, research is focusing upon uncertainty arising at the level of individual measurements (i.e., within- or intra-source), such as for a given sensor output or human response. Here, adopting intervals rather than numbers as the fundamental data-type provides an efficient, powerful, yet challenging way forward -- offering systematic capture of uncertainty-at-source, increasing informational capacity, and ultimately potential for insight. Following recent progress in the capture of interval-valued data, including from human participants, conducting machine learning directly upon intervals is a crucial next step. This paper focuses on linear regression for interval-valued data as a recent growth area, providing an essential foundation for broader use of intervals in AI. We conduct an in-depth analysis of state-of-the-art methods, elucidating their behaviour, advantages, and pitfalls when applied to datasets with different properties. Specific emphasis is given to the challenge of preserving mathematical coherence -- i.e., ensuring that models maintain fundamental mathematical properties of intervals throughout -- and the paper puts forward extensions to an existing approach to guarantee this. Carefully designed experiments, using both synthetic and real-world data, are conducted -- with findings presented alongside novel visualizations for interval-valued regression outputs, designed to maximise model interpretability. Finally, the paper makes recommendations concerning method suitability for data sets with specific properties and highlights remaining challenges and important next steps for developing AI with the capacity to handle uncertainty-at-source.
    Dual Convexified Convolutional Neural Networks. (arXiv:2205.14056v1 [cs.LG])
    We propose the framework of dual convexified convolutional neural networks (DCCNNs). In this framework, we first introduce a primal learning problem motivated from convexified convolutional neural networks (CCNNs), and then construct the dual convex training program through careful analysis of the Karush-Kuhn-Tucker (KKT) conditions and Fenchel conjugates. Our approach reduces the memory overhead of constructing a large kernel matrix and eliminates the ambiguity of factorizing the matrix. Due to the low-rank structure in CCNNs and the related subdifferential of nuclear norms, there is no closed-form expression to recover the primal solution from the dual solution. To overcome this, we propose a highly novel weight recovery algorithm, which takes the dual solution and the kernel information as the input, and recovers the linear and convolutional weights of a CCNN. Furthermore, our recovery algorithm exploits the low-rank structure and imposes a small number of filters indirectly, which reduces the parameter size. As a result, DCCNNs inherit all the statistical benefits of CCNNs, while enjoying a more formal and efficient workflow.
    Adaptive Random Forests for Energy-Efficient Inference on Microcontrollers. (arXiv:2205.13838v1 [cs.LG])
    Random Forests (RFs) are widely used Machine Learning models in low-power embedded devices, due to their hardware friendly operation and high accuracy on practically relevant tasks. The accuracy of a RF often increases with the number of internal weak learners (decision trees), but at the cost of a proportional increase in inference latency and energy consumption. Such costs can be mitigated considering that, in most applications, inputs are not all equally difficult to classify. Therefore, a large RF is often necessary only for (few) hard inputs, and wasteful for easier ones. In this work, we propose an early-stopping mechanism for RFs, which terminates the inference as soon as a high-enough classification confidence is reached, reducing the number of weak learners executed for easy inputs. The early-stopping confidence threshold can be controlled at runtime, in order to favor either energy saving or accuracy. We apply our method to three different embedded classification tasks, on a single-core RISC-V microcontroller, achieving an energy reduction from 38% to more than 90% with a drop of less than 0.5% in accuracy. We also show that our approach outperforms previous adaptive ML methods for RFs.
    DRLComplex: Reconstruction of protein quaternary structures using deep reinforcement learning. (arXiv:2205.13594v1 [cs.LG])
    Predicted inter-chain residue-residue contacts can be used to build the quaternary structure of protein complexes from scratch. However, only a small number of methods have been developed to reconstruct protein quaternary structures using predicted inter-chain contacts. Here, we present an agent-based self-learning method based on deep reinforcement learning (DRLComplex) to build protein complex structures using inter-chain contacts as distance constraints. We rigorously tested DRLComplex on two standard datasets of homodimeric and heterodimeric protein complexes (i.e., the CASP-CAPRI homodimer and Std_32 heterodimer datasets) using both true and predicted interchain contacts as inputs. Utilizing true contacts as input, DRLComplex achieved high average TM-scores of 0.9895 and 0.9881 and a low average interface RMSD (I_RMSD) of 0.2197 and 0.92 on the two datasets, respectively. When predicted contacts are used, the method achieves TM-scores of 0.73 and 0.76 for homodimers and heterodimers, respectively. Our experiments find that the accuracy of reconstructed quaternary structures depends on the accuracy of the contact predictions. Compared to other optimization methods for reconstructing quaternary structures from inter-chain contacts, DRLComplex performs similar to an advanced gradient descent method and better than a Markov Chain Monte Carlo simulation method and a simulated annealing-based method, validating the effectiveness of DRLComplex for quaternary reconstruction of protein complexes.
    BagFlip: A Certified Defense against Data Poisoning. (arXiv:2205.13634v1 [cs.LG])
    Machine learning models are vulnerable to data-poisoning attacks, in which an attacker maliciously modifies the training set to change the prediction of a learned model. In a trigger-less attack, the attacker can modify the training set but not the test inputs, while in a backdoor attack the attacker can also modify test inputs. Existing model-agnostic defense approaches either cannot handle backdoor attacks or do not provide effective certificates (i.e., a proof of a defense). We present BagFlip, a model-agnostic certified approach that can effectively defend against both trigger-less and backdoor attacks. We evaluate BagFlip on image classification and malware detection datasets. BagFlip is equal to or more effective than the state-of-the-art approaches for trigger-less attacks and more effective than the state-of-the-art approaches for backdoor attacks.
    Combining observational datasets from multiple environments to detect hidden confounding. (arXiv:2205.13935v1 [stat.ME])
    A common assumption in causal inference from observational data is the assumption of no hidden confounding. Yet it is, in general, impossible to verify the presence of hidden confounding factors from a single dataset. However, under the assumption of independent causal mechanisms underlying the data generative process, we demonstrate a way to detect unobserved confounders when having multiple observational datasets coming from different environments. We present a theory for testable conditional independencies that are only violated during hidden confounding and examine cases where we break its assumptions: degenerate & dependent mechanisms, and faithfulness violations. Additionally, we propose a procedure to test these independencies and study its empirical finite-sample behavior using simulation studies.
    Deep Learning Fetal Ultrasound Video Model Match Human Observers in Biometric Measurements. (arXiv:2205.13835v1 [eess.IV])
    Objective. This work investigates the use of deep convolutional neural networks (CNN) to automatically perform measurements of fetal body parts, including head circumference, biparietal diameter, abdominal circumference and femur length, and to estimate gestational age and fetal weight using fetal ultrasound videos. Approach. We developed a novel multi-task CNN-based spatio-temporal fetal US feature extraction and standard plane detection algorithm (called FUVAI) and evaluated the method on 50 freehand fetal US video scans. We compared FUVAI fetal biometric measurements with measurements made by five experienced sonographers at two time points separated by at least two weeks. Intra- and inter-observer variabilities were estimated. Main results. We found that automated fetal biometric measurements obtained by FUVAI were comparable to the measurements performed by experienced sonographers The observed differences in measurement values were within the range of inter- and intra-observer variability. Moreover, analysis has shown that these differences were not statistically significant when comparing any individual medical expert to our model. Significance. We argue that FUVAI has the potential to assist sonographers who perform fetal biometric measurements in clinical settings by providing them with suggestions regarding the best measuring frames, along with automated measurements. Moreover, FUVAI is able perform these tasks in just a few seconds, which is a huge difference compared to the average of six minutes taken by sonographers. This is significant, given the shortage of medical experts capable of interpreting fetal ultrasound images in numerous countries.
    FNet: Mixing Tokens with Fourier Transforms. (arXiv:2105.03824v4 [cs.CL] UPDATED)
    We show that Transformer encoder architectures can be sped up, with limited accuracy costs, by replacing the self-attention sublayers with simple linear transformations that "mix" input tokens. These linear mixers, along with standard nonlinearities in feed-forward layers, prove competent at modeling semantic relationships in several text classification tasks. Most surprisingly, we find that replacing the self-attention sublayer in a Transformer encoder with a standard, unparameterized Fourier Transform achieves 92-97% of the accuracy of BERT counterparts on the GLUE benchmark, but trains 80% faster on GPUs and 70% faster on TPUs at standard 512 input lengths. At longer input lengths, our FNet model is significantly faster: when compared to the "efficient" Transformers on the Long Range Arena benchmark, FNet matches the accuracy of the most accurate models, while outpacing the fastest models across all sequence lengths on GPUs (and across relatively shorter lengths on TPUs). Finally, FNet has a light memory footprint and is particularly efficient at smaller model sizes; for a fixed speed and accuracy budget, small FNet models outperform Transformer counterparts.
    Learning to Solve Combinatorial Graph Partitioning Problems via Efficient Exploration. (arXiv:2205.14105v1 [cs.LG])
    From logistics to the natural sciences, combinatorial optimisation on graphs underpins numerous real-world applications. Reinforcement learning (RL) has shown particular promise in this setting as it can adapt to specific problem structures and does not require pre-solved instances for these, often NP-hard, problems. However, state-of-the-art (SOTA) approaches typically suffer from severe scalability issues, primarily due to their reliance on expensive graph neural networks (GNNs) at each decision step. We introduce ECORD; a novel RL algorithm that alleviates this expense by restricting the GNN to a single pre-processing step, before entering a fast-acting exploratory phase directed by a recurrent unit. Experimentally, ECORD achieves a new SOTA for RL algorithms on the Maximum Cut problem, whilst also providing orders of magnitude improvement in speed and scalability. Compared to the nearest competitor, ECORD reduces the optimality gap by up to 73% on 500 vertex graphs with a decreased wall-clock time. Moreover, ECORD retains strong performance when generalising to larger graphs with up to 10000 vertices.
    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. (arXiv:2205.14135v1 [cs.LG])
    Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup. We argue that a missing principle is making attention algorithms IO-aware -- accounting for reads and writes between levels of GPU memory. We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM. We analyze the IO complexity of FlashAttention, showing that it requires fewer HBM accesses than standard attention, and is optimal for a range of SRAM sizes. We also extend FlashAttention to block-sparse attention, yielding an approximate attention algorithm that is faster than any existing approximate attention method. FlashAttention trains Transformers faster than existing baselines: 15% end-to-end wall-clock speedup on BERT-large (seq. length 512) compared to the MLPerf 1.1 training speed record, 3$\times$ speedup on GPT-2 (seq. length 1K), and 2.4$\times$ speedup on long-range arena (seq. length 1K-4K). FlashAttention and block-sparse FlashAttention enable longer context in Transformers, yielding higher quality models (0.7 better perplexity on GPT-2 and 6.4 points of lift on long-document classification) and entirely new capabilities: the first Transformers to achieve better-than-chance performance on the Path-X challenge (seq. length 16K, 61.4% accuracy) and Path-256 (seq. length 64K, 63.1% accuracy).
    Spatio-Temporal Graph Few-Shot Learning with Cross-City Knowledge Transfer. (arXiv:2205.13947v1 [cs.LG])
    Spatio-temporal graph learning is a key method for urban computing tasks, such as traffic flow, taxi demand and air quality forecasting. Due to the high cost of data collection, some developing cities have few available data, which makes it infeasible to train a well-performed model. To address this challenge, cross-city knowledge transfer has shown its promise, where the model learned from data-sufficient cities is leveraged to benefit the learning process of data-scarce cities. However, the spatio-temporal graphs among different cities show irregular structures and varied features, which limits the feasibility of existing Few-Shot Learning (\emph{FSL}) methods. Therefore, we propose a model-agnostic few-shot learning framework for spatio-temporal graph called ST-GFSL. Specifically, to enhance feature extraction by transfering cross-city knowledge, ST-GFSL proposes to generate non-shared parameters based on node-level meta knowledge. The nodes in target city transfer the knowledge via parameter matching, retrieving from similar spatio-temporal characteristics. Furthermore, we propose to reconstruct the graph structure during meta-learning. The graph reconstruction loss is defined to guide structure-aware learning, avoiding structure deviation among different datasets. We conduct comprehensive experiments on four traffic speed prediction benchmarks and the results demonstrate the effectiveness of ST-GFSL compared with state-of-the-art methods.
    Fast Causal Orientation Learning in Directed Acyclic Graphs. (arXiv:2205.13919v1 [cs.LG])
    Causal relationships among a set of variables are commonly represented by a directed acyclic graph. The orientations of some edges in the causal DAG can be discovered from observational/interventional data. Further edges can be oriented by iteratively applying so-called Meek rules. Inferring edges' orientations from some previously oriented edges, which we call Causal Orientation Learning (COL), is a common problem in various causal discovery tasks. In these tasks, it is often required to solve multiple COL problems and therefore applying Meek rules could be time-consuming. Motivated by Meek rules, we introduce Meek functions that can be utilized in solving COL problems. In particular, we show that these functions have some desirable properties, enabling us to speed up the process of applying Meek rules. In particular, we propose a dynamic programming (DP) based method to apply Meek functions. Moreover, based on the proposed DP method, we present a lower bound on the number of edges that can be oriented as a result of intervention. We also propose a method to check whether some oriented edges belong to a causal DAG. Experimental results show that the proposed methods can outperform previous work in several causal discovery tasks in terms of running-time.
    Group-invariant max filtering. (arXiv:2205.14039v1 [cs.IT])
    Given a real inner product space $V$ and a group $G$ of linear isometries, we construct a family of $G$-invariant real-valued functions on $V$ that we call max filters. In the case where $V=\mathbb{R}^d$ and $G$ is finite, a suitable max filter bank separates orbits, and is even bilipschitz in the quotient metric. In the case where $V=L^2(\mathbb{R}^d)$ and $G$ is the group of translation operators, a max filter exhibits stability to diffeomorphic distortion like that of the scattering transform introduced by Mallat. We establish that max filters are well suited for various classification tasks, both in theory and in practice.
    Benign Overparameterization in Membership Inference with Early Stopping. (arXiv:2205.14055v1 [cs.LG])
    Does a neural network's privacy have to be at odds with its accuracy? In this work, we study the effects the number of training epochs and parameters have on a neural network's vulnerability to membership inference (MI) attacks, which aim to extract potentially private information about the training data. We first demonstrate how the number of training epochs and parameters individually induce a privacy-utility trade-off: more of either improves generalization performance at the expense of lower privacy. However, remarkably, we also show that jointly tuning both can eliminate this privacy-utility trade-off. Specifically, with careful tuning of the number of training epochs, more overparameterization can increase model privacy for fixed generalization error. To better understand these phenomena theoretically, we develop a powerful new leave-one-out analysis tool to study the asymptotic behavior of linear classifiers and apply it to characterize the sample-specific loss threshold MI attack in high-dimensional logistic regression. For practitioners, we introduce a low-overhead procedure to estimate MI risk and tune the number of training epochs to guard against MI attacks.
    Global Convergence of Over-parameterized Deep Equilibrium Models. (arXiv:2205.13814v1 [cs.LG])
    A deep equilibrium model (DEQ) is implicitly defined through an equilibrium point of an infinite-depth weight-tied model with an input-injection. Instead of infinite computations, it solves an equilibrium point directly with root-finding and computes gradients with implicit differentiation. The training dynamics of over-parameterized DEQs are investigated in this study. By supposing a condition on the initial equilibrium point, we show that the unique equilibrium point always exists during the training process, and the gradient descent is proved to converge to a globally optimal solution at a linear convergence rate for the quadratic loss function. In order to show that the required initial condition is satisfied via mild over-parameterization, we perform a fine-grained analysis on random DEQs. We propose a novel probabilistic framework to overcome the technical difficulty in the non-asymptotic analysis of infinite-depth weight-tied models.
    Emergent organization of receptive fields in networks of excitatory and inhibitory neurons. (arXiv:2205.13614v1 [q-bio.NC])
    Local patterns of excitation and inhibition that can generate neural waves are studied as a computational mechanism underlying the organization of neuronal tunings. Sparse coding algorithms based on networks of excitatory and inhibitory neurons are proposed that exhibit topographic maps as the receptive fields are adapted to input stimuli. Motivated by a leaky integrate-and-fire model of neural waves, we propose an activation model that is more typical of artificial neural networks. Computational experiments with the activation model using both natural images and natural language text are presented. In the case of images, familiar "pinwheel" patterns of oriented edge detectors emerge; in the case of text, the resulting topographic maps exhibit a 2-dimensional representation of granular word semantics. Experiments with a synthetic model of somatosensory input are used to investigate how the network dynamics may affect plasticity of neuronal maps under changes to the inputs.
    On the Convergence of Semi-Relaxed Sinkhorn with Marginal Constraint and OT Distance Gaps. (arXiv:2205.13846v1 [cs.LG])
    This paper presents consideration of the Semi-Relaxed Sinkhorn (SR-Sinkhorn) algorithm for the semi-relaxed optimal transport (SROT) problem, which relaxes one marginal constraint of the standard OT problem. For evaluation of how the constraint relaxation affects the algorithm behavior and solution, it is vitally necessary to present the theoretical convergence analysis in terms not only of the functional value gap, but also of the marginal constraint gap as well as the OT distance gap. However, no existing work has addressed all analyses simultaneously. To this end, this paper presents a comprehensive convergence analysis for SR-Sinkhorn. After presenting the $\epsilon$-approximation of the functional value gap based on a new proof strategy and exploiting this proof strategy, we give the upper bound of the marginal constraint gap. We also provide its convergence to the $\epsilon$-approximation when two distributions are in the probability simplex. Furthermore, the convergence analysis of the OT distance gap to the $\epsilon$-approximation is given as assisted by the obtained marginal constraint gap. The latter two theoretical results are the first results presented in the literature related to the SROT problem.
    Learning Dynamical Systems via Koopman Operator Regression in Reproducing Kernel Hilbert Spaces. (arXiv:2205.14027v1 [cs.LG])
    We study a class of dynamical systems modelled as Markov chains that admit an invariant distribution via the corresponding transfer, or Koopman, operator. While data-driven algorithms to reconstruct such operators are well known, their relationship with statistical learning is largely unexplored. We formalize a framework to learn the Koopman operator from finite data trajectories of the dynamical system. We consider the restriction of this operator to a reproducing kernel Hilbert space and introduce a notion of risk, from which different estimators naturally arise. We link the risk with the estimation of the spectral decomposition of the Koopman operator. These observations motivate a reduced-rank operator regression (RRR) estimator. We derive learning bounds for the proposed estimator, holding both in i.i.d. and non i.i.d. settings, the latter in terms of mixing coefficients. Our results suggest RRR might be beneficial over other widely used estimators as confirmed in numerical experiments both for forecasting and mode decomposition.
    Improving Bidding and Playing Strategies in the Trick-Taking game Wizard using Deep Q-Networks. (arXiv:2205.13834v1 [cs.LG])
    In this work, the trick-taking game Wizard with a separate bidding and playing phase is modeled by two interleaved partially observable Markov decision processes (POMDP). Deep Q-Networks (DQN) are used to empower self-improving agents, which are capable of tackling the challenges of a highly non-stationary environment. To compare algorithms between each other, the accuracy between bid and trick count is monitored, which strongly correlates with the actual rewards and provides a well-defined upper and lower performance bound. The trained DQN agents achieve accuracies between 66% and 87% in self-play, leaving behind both a random baseline and a rule-based heuristic. The conducted analysis also reveals a strong information asymmetry concerning player positions during bidding. To overcome the missing Markov property of imperfect-information games, a long short-term memory (LSTM) network is implemented to integrate historic information into the decision-making process. Additionally, a forward-directed tree search is conducted by sampling a state of the environment and thereby turning the game into a perfect information setting. To our surprise, both approaches do not surpass the performance of the basic DQN agent.
    Probabilistic Systems with Hidden State and Unobservable Transitions. (arXiv:2205.13871v1 [cs.LG])
    We consider probabilistic systems with hidden state and unobservable transitions, an extension of Hidden Markov Models (HMMs) that in particular admits unobservable {\epsilon}-transitions (also called null transitions), allowing state changes of which the observer is unaware. Due to the presence of {\epsilon}-loops this additional feature complicates the theory and requires to carefully set up the corresponding probability space and random variables. In particular we present an algorithm for determining the most probable explanation given an observation (a generalization of the Viterbi algorithm for HMMs) and a method for parameter learning that adapts the probabilities of a given model based on an observation (a generalization of the Baum-Welch algorithm). The latter algorithm guarantees that the given observation has a higher (or equal) probability after adjustment of the parameters and its correctness can be derived directly from the so-called EM algorithm.
    Standalone Neural ODEs with Sensitivity Analysis. (arXiv:2205.13933v1 [cs.LG])
    This paper presents the Standalone Neural ODE (sNODE), a continuous-depth neural ODE model capable of describing a full deep neural network. This uses a novel nonlinear conjugate gradient (NCG) descent optimization scheme for training, where the Sobolev gradient can be incorporated to improve smoothness of model weights. We also present a general formulation of the neural sensitivity problem and show how it is used in the NCG training. The sensitivity analysis provides a reliable measure of uncertainty propagation throughout a network, and can be used to study model robustness and to generate adversarial attacks. Our evaluations demonstrate that our novel formulations lead to increased robustness and performance as compared to ResNet models, and that it opens up for new opportunities for designing and developing machine learning with improved explainability.
    Sample-Efficient Optimisation with Probabilistic Transformer Surrogates. (arXiv:2205.13902v1 [cs.LG])
    Faced with problems of increasing complexity, recent research in Bayesian Optimisation (BO) has focused on adapting deep probabilistic models as flexible alternatives to Gaussian Processes (GPs). In a similar vein, this paper investigates the feasibility of employing state-of-the-art probabilistic transformers in BO. Upon further investigation, we observe two drawbacks stemming from their training procedure and loss definition, hindering their direct deployment as proxies in black-box optimisation. First, we notice that these models are trained on uniformly distributed inputs, which impairs predictive accuracy on non-uniform data - a setting arising from any typical BO loop due to exploration-exploitation trade-offs. Second, we realise that training losses (e.g., cross-entropy) only asymptotically guarantee accurate posterior approximations, i.e., after arriving at the global optimum, which generally cannot be ensured. At the stationary points of the loss function, however, we observe a degradation in predictive performance especially in exploratory regions of the input space. To tackle these shortcomings we introduce two components: 1) a BO-tailored training prior supporting non-uniformly distributed points, and 2) a novel approximate posterior regulariser trading-off accuracy and input sensitivity to filter favourable stationary points for improved predictive performance. In a large panel of experiments, we demonstrate, for the first time, that one transformer pre-trained on data sampled from random GP priors produces competitive results on 16 benchmark black-boxes compared to GP-based BO. Since our model is only pre-trained once and used in all tasks without any retraining and/or fine-tuning, we report an order of magnitude time-reduction, while matching and sometimes outperforming GPs.
    Counterfactual Analysis in Dynamic Models: Copulas and Bounds. (arXiv:2205.13832v1 [cs.LG])
    We provide an explicit model of the causal mechanism in a structural causal model (SCM) with the goal of estimating counterfactual quantities of interest (CQIs). We propose some standard dependence structures, i.e. copulas, as base cases for the causal mechanism. While these base cases can be used to construct more interesting copulas, there are uncountably many copulas in general and so we formulate optimization problems for bounding the CQIs. As our ultimate goal is counterfactual reasoning in dynamic models which may have latent-states, we show by way of example that filtering / smoothing / sampling methods for these models can be integrated with our modeling of the causal mechanism. Specifically, we consider the "cheating-at-the-casino" application of a hidden Markov model and use linear programming (LP) to construct lower and upper bounds on the casino's winnings due to cheating. These bounds are considerably tighter when we constrain the copulas in the LPs to be time-independent. We can characterize the entire space of SCMs obeying counterfactual stability (CS), and we use it to negatively answer the open question of Oberst and Sontag [18] regarding the uniqueness of the Gumbel-max mechanism for modeling CS. Our work has applications in epidemiology and legal reasoning, and more generally in counterfactual off-policy evaluation, a topic of increasing interest in the reinforcement learning community.
    On Consistency in Graph Neural Network Interpretation. (arXiv:2205.13733v1 [cs.LG])
    Uncovering rationales behind predictions of graph neural networks (GNNs) has received increasing attention over recent years. Instance-level GNN explanation aims to discover critical input elements, like nodes or edges, that the target GNN relies upon for making predictions. These identified sub-structures can provide interpretations of GNN's behavior. Though various algorithms are proposed, most of them formalize this task by searching the minimal subgraph which can preserve original predictions. An inductive bias is deep-rooted in this framework: the same output cannot guarantee that two inputs are processed under the same rationale. Consequently, they have the danger of providing spurious explanations and fail to provide consistent explanations. Applying them to explain weakly-performed GNNs would further amplify these issues. To address the issues, we propose to obtain more faithful and consistent explanations of GNNs. After a close examination on predictions of GNNs from the causality perspective, we attribute spurious explanations to two typical reasons: confounding effect of latent variables like distribution shift, and causal factors distinct from the original input. Motivated by the observation that both confounding effects and diverse causal rationales are encoded in internal representations, we propose a simple yet effective countermeasure by aligning embeddings. This new objective can be incorporated into existing GNN explanation algorithms with no effort. We implement both a simplified version based on absolute distance and a distribution-aware version based on anchors. Experiments on $5$ datasets validate its effectiveness, and theoretical analysis shows that it is in effect optimizing a more faithful explanation objective in design, which further justifies the proposed approach.
    Feudal Multi-Agent Reinforcement Learning with Adaptive Network Partition for Traffic Signal Control. (arXiv:2205.13836v1 [cs.MA])
    Multi-agent reinforcement learning (MARL) has been applied and shown great potential in multi-intersections traffic signal control, where multiple agents, one for each intersection, must cooperate together to optimize traffic flow. To encourage global cooperation, previous work partitions the traffic network into several regions and learns policies for agents in a feudal structure. However, static network partition fails to adapt to dynamic traffic flow, which will changes frequently over time. To address this, we propose a novel feudal MARL approach with adaptive network partition. Specifically, we first partition the network into several regions according to the traffic flow. To do this, we propose two approaches: one is directly to use graph neural network (GNN) to generate the network partition, and the other is to use Monte-Carlo tree search (MCTS) to find the best partition with criteria computed by GNN. Then, we design a variant of Qmix using GNN to handle various dimensions of input, given by the dynamic network partition. Finally, we use a feudal hierarchy to manage agents in each partition and promote global cooperation. By doing so, agents are able to adapt to the traffic flow as required in practice. We empirically evaluate our method both in a synthetic traffic grid and real-world traffic networks of three cities, widely used in the literature. Our experimental results confirm that our method can achieve better performance, in terms of average travel time and queue length, than several leading methods for traffic signal control.
    Global Normalization for Streaming Speech Recognition in a Modular Framework. (arXiv:2205.13674v1 [cs.LG])
    We introduce the Globally Normalized Autoregressive Transducer (GNAT) for addressing the label bias problem in streaming speech recognition. Our solution admits a tractable exact computation of the denominator for the sequence-level normalization. Through theoretical and empirical results, we demonstrate that by switching to a globally normalized model, the word error rate gap between streaming and non-streaming speech-recognition models can be greatly reduced (by more than 50\% on the Librispeech dataset). This model is developed in a modular framework which encompasses all the common neural speech recognition models. The modularity of this framework enables controlled comparison of modelling choices and creation of new models.
    Isolating and Leveraging Controllable and Noncontrollable Visual Dynamics in World Models. (arXiv:2205.13817v1 [cs.LG])
    World models learn the consequences of actions in vision-based interactive systems. However, in practical scenarios such as autonomous driving, there commonly exists noncontrollable dynamics independent of the action signals, making it difficult to learn effective world models. To tackle this problem, we present a novel reinforcement learning approach named Iso-Dream, which improves the Dream-to-Control framework in two aspects. First, by optimizing the inverse dynamics, we encourage the world model to learn controllable and noncontrollable sources of spatiotemporal changes on isolated state transition branches. Second, we optimize the behavior of the agent on the decoupled latent imaginations of the world model. Specifically, to estimate state values, we roll-out the noncontrollable states into the future and associate them with the current controllable state. In this way, the isolation of dynamics sources can greatly benefit long-horizon decision-making of the agent, such as a self-driving car that can avoid potential risks by anticipating the movement of other vehicles. Experiments show that Iso-Dream is effective in decoupling the mixed dynamics and remarkably outperforms existing approaches in a wide range of visual control and prediction domains.
    Can Foundation Models Help Us Achieve Perfect Secrecy?. (arXiv:2205.13722v1 [cs.LG])
    A key promise of machine learning is the ability to assist users with personal tasks. Because the personal context required to make accurate predictions is often sensitive, we require systems that protect privacy. A gold standard privacy-preserving system will satisfy perfect secrecy, meaning that interactions with the system provably reveal no additional private information to adversaries. This guarantee should hold even as we perform multiple personal tasks over the same underlying data. However, privacy and quality appear to be in tension in existing systems for personal tasks. Neural models typically require lots of training to perform well, while individual users typically hold a limited scale of data, so the systems propose to learn from the aggregate data of multiple users. This violates perfect secrecy and instead, in the last few years, academics have defended these solutions using statistical notions of privacy -- i.e., the probability of learning private information about a user should be reasonably low. Given the vulnerabilities of these solutions, we explore whether the strong perfect secrecy guarantee can be achieved using recent zero-to-few sample adaptation techniques enabled by foundation models. In response, we propose FOCUS, a framework for personal tasks. Evaluating on popular privacy benchmarks, we find the approach, satisfying perfect secrecy, competes with strong collaborative learning baselines on 6 of 7 tasks. We empirically analyze the proposal, highlighting the opportunities and limitations across task types, and model inductive biases and sizes.
    Prune and distill: similar reformatting of image information along rat visual cortex and deep neural networks. (arXiv:2205.13816v1 [q-bio.NC])
    Visual object recognition has been extensively studied in both neuroscience and computer vision. Recently, the most popular class of artificial systems for this task, deep convolutional neural networks (CNNs), has been shown to provide excellent models for its functional analogue in the brain, the ventral stream in visual cortex. This has prompted questions on what, if any, are the common principles underlying the reformatting of visual information as it flows through a CNN or the ventral stream. Here we consider some prominent statistical patterns that are known to exist in the internal representations of either CNNs or the visual cortex and look for them in the other system. We show that intrinsic dimensionality (ID) of object representations along the rat homologue of the ventral stream presents two distinct expansion-contraction phases, as previously shown for CNNs. Conversely, in CNNs, we show that training results in both distillation and active pruning (mirroring the increase in ID) of low- to middle-level image information in single units, as representations gain the ability to support invariant discrimination, in agreement with previous observations in rat visual cortex. Taken together, our findings suggest that CNNs and visual cortex share a similarly tight relationship between dimensionality expansion/reduction of object representations and reformatting of image information.
    FedFormer: Contextual Federation with Attention in Reinforcement Learning. (arXiv:2205.13697v1 [cs.LG])
    A core issue in federated reinforcement learning is defining how to aggregate insights from multiple agents into one. This is commonly done by taking the average of each participating agent's model weights into one common model (FedAvg). We instead propose FedFormer, a novel federation strategy that utilizes Transformer Attention to contextually aggregate embeddings from models originating from different learner agents. In so doing, we attentively weigh contributions of other agents with respect to the current agent's environment and learned relationships, thus providing more effective and efficient federation. We evaluate our methods on the Meta-World environment and find that our approach yields significant improvements over FedAvg and non-federated Soft Actor Critique single agent methods. Our results compared to Soft Actor Critique show that FedFormer performs better while still abiding by the privacy constraints of federated learning. In addition, we demonstrate nearly linear improvements in effectiveness with increased agent pools in certain tasks. This is contrasted by FedAvg, which fails to make noticeable improvements when scaled.
    Joint rotational invariance and adversarial training of a dual-stream Transformer yields state of the art Brain-Score for Area V4. (arXiv:2203.06649v2 [q-bio.NC] UPDATED)
    Modern high-scoring models of vision in the brain score competition do not stem from Vision Transformers. However, in this paper, we provide evidence against the unexpected trend of Vision Transformers (ViT) being not perceptually aligned with human visual representations by showing how a dual-stream Transformer, a CrossViT$~\textit{a la}$ Chen et al. (2021), under a joint rotationally-invariant and adversarial optimization procedure yields 2nd place in the aggregate Brain-Score 2022 competition(Schrimpf et al., 2020b) averaged across all visual categories, and at the time of the competition held 1st place for the highest explainable variance of area V4. In addition, our current Transformer-based model also achieves greater explainable variance for areas V4, IT and Behaviour than a biologically-inspired CNN (ResNet50) that integrates a frontal V1-like computation module (Dapello et al.,2020). To assess the contribution of the optimization scheme with respect to the CrossViT architecture, we perform several additional experiments on differently optimized CrossViT's regarding adversarial robustness, common corruption benchmarks, mid-ventral stimuli interpretation and feature inversion. Against our initial expectations, our family of results provides tentative support for an $\textit{"All roads lead to Rome"}$ argument enforced via a joint optimization rule even for non biologically-motivated models of vision such as Vision Transformers. Code is available at https://github.com/williamberrios/BrainScore-Transformers
    Effective Abstract Reasoning with Dual-Contrast Network. (arXiv:2205.13720v1 [cs.CV])
    As a step towards improving the abstract reasoning capability of machines, we aim to solve Raven's Progressive Matrices (RPM) with neural networks, since solving RPM puzzles is highly correlated with human intelligence. Unlike previous methods that use auxiliary annotations or assume hidden rules to produce appropriate feature representation, we only use the ground truth answer of each question for model learning, aiming for an intelligent agent to have a strong learning capability with a small amount of supervision. Based on the RPM problem formulation, the correct answer filled into the missing entry of the third row/column has to best satisfy the same rules shared between the first two rows/columns. Thus we design a simple yet effective Dual-Contrast Network (DCNet) to exploit the inherent structure of RPM puzzles. Specifically, a rule contrast module is designed to compare the latent rules between the filled row/column and the first two rows/columns; a choice contrast module is designed to increase the relative differences between candidate choices. Experimental results on the RAVEN and PGM datasets show that DCNet outperforms the state-of-the-art methods by a large margin of 5.77%. Further experiments on few training samples and model generalization also show the effectiveness of DCNet. Code is available at https://github.com/visiontao/dcnet.
    A Simple and Universal Rotation Equivariant Point-cloud Network. (arXiv:2203.01216v3 [cs.LG] UPDATED)
    Equivariance to permutations and rigid motions is an important inductive bias for various 3D learning problems. Recently it has been shown that the equivariant Tensor Field Network architecture is universal -- it can approximate any equivariant function. In this paper we suggest a much simpler architecture, prove that it enjoys the same universality guarantees and evaluate its performance on Modelnet40. The code to reproduce our experiments is available at \url{https://github.com/simpleinvariance/UniversalNetwork}
    RIGID: Robust Linear Regression with Missing Data. (arXiv:2205.13635v1 [cs.LG])
    We present a robust framework to perform linear regression with missing entries in the features. By considering an elliptical data distribution, and specifically a multivariate normal model, we are able to conditionally formulate a distribution for the missing entries and present a robust framework, which minimizes the worst case error caused by the uncertainty about the missing data. We show that the proposed formulation, which naturally takes into account the dependency between different variables, ultimately reduces to a convex program, for which a customized and scalable solver can be delivered. In addition to a detailed analysis to deliver such solver, we also asymptoticly analyze the behavior of the proposed framework, and present technical discussions to estimate the required input parameters. We complement our analysis with experiments performed on synthetic, semi-synthetic, and real data, and show how the proposed formulation improves the prediction accuracy and robustness, and outperforms the competing techniques.
    Subverting machines, fluctuating identities: Re-learning human categorization. (arXiv:2205.13740v1 [cs.LG])
    Most machine learning systems that interact with humans construct some notion of a person's "identity," yet the default paradigm in AI research envisions identity with essential attributes that are discrete and static. In stark contrast, strands of thought within critical theory present a conception of identity as malleable and constructed entirely through interaction; a doing rather than a being. In this work, we distill some of these ideas for machine learning practitioners and introduce a theory of identity as autopoiesis, circular processes of formation and function. We argue that the default paradigm of identity used by the field immobilizes existing identity categories and the power differentials that co$\unicode{x2010}$occur, due to the absence of iterative feedback to our models. This includes a critique of emergent AI fairness practices that continue to impose the default paradigm. Finally, we apply our theory to sketch approaches to autopoietic identity through multilevel optimization and relational learning. While these ideas raise many open questions, we imagine the possibilities of machines that are capable of expressing human identity as a relationship perpetually in flux.
    Regularized Gradient Descent Ascent for Two-Player Zero-Sum Markov Games. (arXiv:2205.13746v1 [math.OC])
    We study the problem of finding the Nash equilibrium in a two-player zero-sum Markov game. Due to its formulation as a minimax optimization program, a natural approach to solve the problem is to perform gradient descent/ascent with respect to each player in an alternating fashion. However, due to the non-convexity/non-concavity of the underlying objective function, theoretical understandings of this method are limited. In our paper, we consider solving an entropy-regularized variant of the Markov game. The regularization introduces structure into the optimization landscape that make the solutions more identifiable and allow the problem to be solved more efficiently. Our main contribution is to show that under proper choices of the regularization parameter, the gradient descent ascent algorithm converges to the Nash equilibrium of the original unregularized problem. We explicitly characterize the finite-time performance of the last iterate of our algorithm, which vastly improves over the existing convergence bound of the gradient descent ascent algorithm without regularization. Finally, we complement the analysis with numerical simulations that illustrate the accelerated convergence of the algorithm.
    A Sea of Words: An In-Depth Analysis of Anchors for Text Data. (arXiv:2205.13789v1 [stat.ML])
    Anchors [Ribeiro et al. (2018)] is a post-hoc, rule-based interpretability method. For text data, it proposes to explain a decision by highlighting a small set of words (an anchor) such that the model to explain has similar outputs when they are present in a document. In this paper, we present the first theoretical analysis of Anchors, considering that the search for the best anchor is exhaustive. We leverage this analysis to gain insights on the behavior of Anchors on simple models, including elementary if-then rules and linear classifiers.
    Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. (arXiv:2205.13760v1 [cs.LG])
    The ability to accurately model the fitness landscape of protein sequences is critical to a wide range of applications, from quantifying the effects of human variants on disease likelihood, to predicting immune-escape mutations in viruses and designing novel biotherapeutic proteins. Deep generative models of protein sequences trained on multiple sequence alignments have been the most successful approaches so far to address these tasks. The performance of these methods is however contingent on the availability of sufficiently deep and diverse alignments for reliable training. Their potential scope is thus limited by the fact many protein families are hard, if not impossible, to align. Large language models trained on massive quantities of non-aligned protein sequences from diverse families address these problems and show potential to eventually bridge the performance gap. We introduce Tranception, a novel transformer architecture leveraging autoregressive predictions and retrieval of homologous sequences at inference to achieve state-of-the-art fitness prediction performance. Given its markedly higher performance on multiple mutants, robustness to shallow alignments and ability to score indels, our approach offers significant gain of scope over existing approaches. To enable more rigorous model testing across a broader range of protein families, we develop ProteinGym -- an extensive set of multiplexed assays of variant effects, substantially increasing both the number and diversity of assays compared to existing benchmarks.
    Probabilistic Forecasting with Generative Networks via Scoring Rule Minimization. (arXiv:2112.08217v2 [stat.ML] UPDATED)
    Generative networks are often trained to minimize a statistical divergence between the reference distribution and the generative one in an adversarial setting. Some works trained instead generative networks to minimize Scoring Rules, functions assessing how well the generative distribution matches each training sample individually. We show how the Scoring Rule formulation easily extends to the so-called prequential (predictive-sequential) score, whose minimization allows performing probabilistic forecasting with generative networks. This objective leads to adversarial-free training, therefore easily avoiding uncertainty underestimation due to mode collapse, which is a common issue in the adversarial setting and undesirable for probabilistic forecasting. We provide consistency guarantees for the minimizer of the prequential score and employ that to perform probabilistic forecasting for two chaotic dynamical models and a benchmark dataset of global weather observations. For this last example, we define scoring rules for spatial data by drawing from the relevant literature, with which we obtain better uncertainty quantification with little hyperparameter tuning compared to adversarial training.
    Bootstrapping Informative Graph Augmentation via A Meta Learning Approach. (arXiv:2201.03812v3 [cs.LG] UPDATED)
    Recent works explore learning graph representations in a self-supervised manner. In graph contrastive learning, benchmark methods apply various graph augmentation approaches. However, most of the augmentation methods are non-learnable, which causes the issue of generating unbeneficial augmented graphs. Such augmentation may degenerate the representation ability of graph contrastive learning methods. Therefore, we motivate our method to generate augmented graph by a learnable graph augmenter, called MEta Graph Augmentation (MEGA). We then clarify that a "good" graph augmentation must have uniformity at the instance-level and informativeness at the feature-level. To this end, we propose a novel approach to learning a graph augmenter that can generate an augmentation with uniformity and informativeness. The objective of the graph augmenter is to promote our feature extraction network to learn a more discriminative feature representation, which motivates us to propose a meta-learning paradigm. Empirically, the experiments across multiple benchmark datasets demonstrate that MEGA outperforms the state-of-the-art methods in graph self-supervised learning tasks. Further experimental studies prove the effectiveness of different terms of MEGA.
    Generating personalized counterfactual interventions for algorithmic recourse by eliciting user preferences. (arXiv:2205.13743v1 [cs.LG])
    Counterfactual interventions are a powerful tool to explain the decisions of a black-box decision process, and to enable algorithmic recourse. They are a sequence of actions that, if performed by a user, can overturn an unfavourable decision made by an automated decision system. However, most of the current methods provide interventions without considering the user's preferences. For example, a user might prefer doing certain actions with respect to others. In this work, we present the first human-in-the-loop approach to perform algorithmic recourse by eliciting user preferences. We introduce a polynomial procedure to ask choice-set questions which maximize the Expected Utility of Selection (EUS), and use it to iteratively refine our cost estimates in a Bayesian setting. We integrate this preference elicitation strategy into a reinforcement learning agent coupled with Monte Carlo Tree Search for efficient exploration, so as to provide personalized interventions achieving algorithmic recourse. An experimental evaluation on synthetic and real-world datasets shows that a handful of queries allows to achieve a substantial reduction in the cost of interventions with respect to user-independent alternatives.
    (De-)Randomized Smoothing for Decision Stump Ensembles. (arXiv:2205.13909v1 [cs.LG])
    Tree-based models are used in many high-stakes application domains such as finance and medicine, where robustness and interpretability are of utmost importance. Yet, methods for improving and certifying their robustness are severely under-explored, in contrast to those focusing on neural networks. Targeting this important challenge, we propose deterministic smoothing for decision stump ensembles. Whereas most prior work on randomized smoothing focuses on evaluating arbitrary base models approximately under input randomization, the key insight of our work is that decision stump ensembles enable exact yet efficient evaluation via dynamic programming. Importantly, we obtain deterministic robustness certificates, even jointly over numerical and categorical features, a setting ubiquitous in the real world. Further, we derive an MLE-optimal training method for smoothed decision stumps under randomization and propose two boosting approaches to improve their provable robustness. An extensive experimental evaluation shows that our approach yields significantly higher certified accuracies than the state-of-the-art for tree-based models. We release all code and trained models at ANONYMIZED.
    Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions. (arXiv:2205.13803v1 [cs.CV])
    A significant gap remains between today's visual pattern recognition models and human-level visual cognition especially when it comes to few-shot learning and compositional reasoning of novel concepts. We introduce Bongard-HOI, a new visual reasoning benchmark that focuses on compositional learning of human-object interactions (HOIs) from natural images. It is inspired by two desirable characteristics from the classical Bongard problems (BPs): 1) few-shot concept learning, and 2) context-dependent reasoning. We carefully curate the few-shot instances with hard negatives, where positive and negative images only disagree on action labels, making mere recognition of object categories insufficient to complete our benchmarks. We also design multiple test sets to systematically study the generalization of visual learning models, where we vary the overlap of the HOI concepts between the training and test sets of few-shot instances, from partial to no overlaps. Bongard-HOI presents a substantial challenge to today's visual recognition models. The state-of-the-art HOI detection model achieves only 62% accuracy on few-shot binary prediction while even amateur human testers on MTurk have 91% accuracy. With the Bongard-HOI benchmark, we hope to further advance research efforts in visual reasoning, especially in holistic perception-reasoning systems and better representation learning.
    VectorAdam for Rotation Equivariant Geometry Optimization. (arXiv:2205.13599v1 [cs.LG])
    The rise of geometric problems in machine learning has necessitated the development of equivariant methods, which preserve their output under the action of rotation or some other transformation. At the same time, the Adam optimization algorithm has proven remarkably effective across machine learning and even traditional tasks in geometric optimization. In this work, we observe that naively applying Adam to optimize vector-valued data is not rotation equivariant, due to per-coordinate moment updates, and in fact this leads to significant artifacts and biases in practice. We propose to resolve this deficiency with VectorAdam, a simple modification which makes Adam rotation-equivariant by accounting for the vector structure of optimization variables. We demonstrate this approach on problems in machine learning and traditional geometric optimization, showing that equivariant VectorAdam resolves the artifacts and biases of traditional Adam when applied to vector-valued data, with equivalent or even improved rates of convergence.
    Why So Pessimistic? Estimating Uncertainties for Offline RL through Ensembles, and Why Their Independence Matters. (arXiv:2205.13703v1 [cs.LG])
    Motivated by the success of ensembles for uncertainty estimation in supervised learning, we take a renewed look at how ensembles of $Q$-functions can be leveraged as the primary source of pessimism for offline reinforcement learning (RL). We begin by identifying a critical flaw in a popular algorithmic choice used by many ensemble-based RL algorithms, namely the use of shared pessimistic target values when computing each ensemble member's Bellman error. Through theoretical analyses and construction of examples in toy MDPs, we demonstrate that shared pessimistic targets can paradoxically lead to value estimates that are effectively optimistic. Given this result, we propose MSG, a practical offline RL algorithm that trains an ensemble of $Q$-functions with independently computed targets based on completely separate networks, and optimizes a policy with respect to the lower confidence bound of predicted action values. Our experiments on the popular D4RL and RL Unplugged offline RL benchmarks demonstrate that on challenging domains such as antmazes, MSG with deep ensembles surpasses highly well-tuned state-of-the-art methods by a wide margin. Additionally, through ablations on benchmarks domains, we verify the critical significance of using independently trained $Q$-functions, and study the role of ensemble size. Finally, as using separate networks per ensemble member can become computationally costly with larger neural network architectures, we investigate whether efficient ensemble approximations developed for supervised learning can be similarly effective, and demonstrate that they do not match the performance and robustness of MSG with separate networks, highlighting the need for new efforts into efficient uncertainty estimation directed at RL.
    Inference and Sampling for Archimax Copulas. (arXiv:2205.14025v1 [stat.ME])
    Understanding multivariate dependencies in both the bulk and the tails of a distribution is an important problem for many applications, such as ensuring algorithms are robust to observations that are infrequent but have devastating effects. Archimax copulas are a family of distributions endowed with a precise representation that allows simultaneous modeling of the bulk and the tails of a distribution. Rather than separating the two as is typically done in practice, incorporating additional information from the bulk may improve inference of the tails, where observations are limited. Building on the stochastic representation of Archimax copulas, we develop a non-parametric inference method and sampling algorithm. Our proposed methods, to the best of our knowledge, are the first that allow for highly flexible and scalable inference and sampling algorithms, enabling the increased use of Archimax copulas in practical settings. We experimentally compare to state-of-the-art density modeling techniques, and the results suggest that the proposed method effectively extrapolates to the tails while scaling to higher dimensional data. Our findings suggest that the proposed algorithms can be used in a variety of applications where understanding the interplay between the bulk and the tails of a distribution is necessary, such as healthcare and safety.
    Solving infinite-horizon POMDPs with memoryless stochastic policies in state-action space. (arXiv:2205.14098v1 [cs.LG])
    Reward optimization in fully observable Markov decision processes is equivalent to a linear program over the polytope of state-action frequencies. Taking a similar perspective in the case of partially observable Markov decision processes with memoryless stochastic policies, the problem was recently formulated as the optimization of a linear objective subject to polynomial constraints. Based on this we present an approach for Reward Optimization in State-Action space (ROSA). We test this approach experimentally in maze navigation tasks. We find that ROSA is computationally efficient and can yield stability improvements over other existing methods.
    Spartan: Differentiable Sparsity via Regularized Transportation. (arXiv:2205.14107v1 [cs.LG])
    We present Spartan, a method for training sparse neural network models with a predetermined level of sparsity. Spartan is based on a combination of two techniques: (1) soft top-k masking of low-magnitude parameters via a regularized optimal transportation problem and (2) dual averaging-based parameter updates with hard sparsification in the forward pass. This scheme realizes an exploration-exploitation tradeoff: early in training, the learner is able to explore various sparsity patterns, and as the soft top-k approximation is gradually sharpened over the course of training, the balance shifts towards parameter optimization with respect to a fixed sparsity mask. Spartan is sufficiently flexible to accommodate a variety of sparsity allocation policies, including both unstructured and block structured sparsity, as well as general cost-sensitive sparsity allocation mediated by linear models of per-parameter costs. On ImageNet-1K classification, Spartan yields 95% sparse ResNet-50 models and 90% block sparse ViT-B/16 models while incurring absolute top-1 accuracy losses of less than 1% compared to fully dense training.
    Teaching Agents how to Map: Spatial Reasoning for Multi-Object Navigation. (arXiv:2107.06011v3 [cs.CV] UPDATED)
    In the context of visual navigation, the capacity to map a novel environment is necessary for an agent to exploit its observation history in the considered place and efficiently reach known goals. This ability can be associated with spatial reasoning, where an agent is able to perceive spatial relationships and regularities, and discover object characteristics. Recent work introduces learnable policies parametrized by deep neural networks and trained with Reinforcement Learning (RL). In classical RL setups, the capacity to map and reason spatially is learned end-to-end, from reward alone. In this setting, we introduce supplementary supervision in the form of auxiliary tasks designed to favor the emergence of spatial perception capabilities in agents trained for a goal-reaching downstream objective. We show that learning to estimate metrics quantifying the spatial relationships between an agent at a given location and a goal to reach has a high positive impact in Multi-Object Navigation settings. Our method significantly improves the performance of different baseline agents, that either build an explicit or implicit representation of the environment, even matching the performance of incomparable oracle agents taking ground-truth maps as input. A learning-based agent from the literature trained with the proposed auxiliary losses was the winning entry to the Multi-Object Navigation Challenge, part of the CVPR 2021 Embodied AI Workshop.
    Learning to Control Linear Systems can be Hard. (arXiv:2205.14035v1 [cs.LG])
    In this paper, we study the statistical difficulty of learning to control linear systems. We focus on two standard benchmarks, the sample complexity of stabilization, and the regret of the online learning of the Linear Quadratic Regulator (LQR). Prior results state that the statistical difficulty for both benchmarks scales polynomially with the system state dimension up to system-theoretic quantities. However, this does not reveal the whole picture. By utilizing minimax lower bounds for both benchmarks, we prove that there exist non-trivial classes of systems for which learning complexity scales dramatically, i.e. exponentially, with the system dimension. This situation arises in the case of underactuated systems, i.e. systems with fewer inputs than states. Such systems are structurally difficult to control and their system theoretic quantities can scale exponentially with the system dimension dominating learning complexity. Under some additional structural assumptions (bounding systems away from uncontrollability), we provide qualitatively matching upper bounds. We prove that learning complexity can be at most exponential with the controllability index of the system, that is the degree of underactuation.
    Probabilistic Transformer: Modelling Ambiguities and Distributions for RNA Folding and Molecule Design. (arXiv:2205.13927v1 [cs.LG])
    Our world is ambiguous and this is reflected in the data we use to train our algorithms. This is especially true when we try to model natural processes where collected data is affected by noisy measurements and differences in measurement techniques. Sometimes, the process itself can be ambiguous, such as in the case of RNA folding, where a single nucleotide sequence can fold into multiple structures. This ambiguity suggests that a predictive model should have similar probabilistic characteristics to match the data it models. Therefore, we propose a hierarchical latent distribution to enhance one of the most successful deep learning models, the Transformer, to accommodate ambiguities and data distributions. We show the benefits of our approach on a synthetic task, with state-of-the-art results in RNA folding, and demonstrate its generative capabilities on property-based molecule design, outperforming existing work.
    Composing Partial Differential Equations with Physics-Aware Neural Networks. (arXiv:2111.11798v2 [cs.LG] UPDATED)
    We introduce a compositional physics-aware FInite volume Neural Network (FINN) for learning spatiotemporal advection-diffusion processes. FINN implements a new way of combining the learning abilities of artificial neural networks with physical and structural knowledge from numerical simulation by modeling the constituents of partial differential equations (PDEs) in a compositional manner. Results on both one- and two-dimensional PDEs (Burgers', diffusion-sorption, diffusion-reaction, Allen--Cahn) demonstrate FINN's superior modeling accuracy and excellent out-of-distribution generalization ability beyond initial and boundary conditions. With only one tenth of the number of parameters on average, FINN outperforms pure machine learning and other state-of-the-art physics-aware models in all cases -- often even by multiple orders of magnitude. Moreover, FINN outperforms a calibrated physical model when approximating sparse real-world data in a diffusion-sorption scenario, confirming its generalization abilities and showing explanatory potential by revealing the unknown retardation factor of the observed process.
    Bias Reduction via Cooperative Bargaining in Synthetic Graph Dataset Generation. (arXiv:2205.13901v1 [cs.LG])
    In general, to draw robust conclusions from a dataset, all the analyzed population must be represented on said dataset. Having a dataset that does not fulfill this condition normally leads to selection bias. Additionally, graphs have been used to model a wide variety of problems. Although synthetic graphs can be used to augment available real graph datasets to overcome selection bias, the generation of unbiased synthetic datasets is complex with current tools. In this work, we propose a method to find a synthetic graph dataset that has an even representation of graphs with different metrics. The resulting dataset can then be used, among others, for benchmarking graph processing techniques as the accuracy of different Graph Neural Network (GNN) models or the speedups obtained by different graph processing acceleration frameworks.
    MIMII DG: Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection for Domain Generalization Task. (arXiv:2205.13879v1 [cs.SD])
    We present a machine sound dataset to benchmark domain generalization techniques for anomalous sound detection (ASD). To handle performance degradation caused by domain shifts that are difficult to detect or too frequent to adapt, domain generalization techniques are preferred. However, currently available datasets have difficulties in evaluating these techniques, such as limited number of values for parameters that cause domain shifts (domain shift parameters). In this paper, we present the first ASD dataset for the domain generalization techniques, called MIMII DG. The dataset consists of five machine types and three domain shift scenarios for each machine type. We prepared at least two values for the domain shift parameters in the source domain. Also, we introduced domain shifts that can be difficult to notice. Experimental results using two baseline systems indicate that the dataset reproduces the domain shift scenarios and is useful for benchmarking domain generalization techniques.
    Non-Markovian policies occupancy measures. (arXiv:2205.13950v1 [cs.LG])
    A central object of study in Reinforcement Learning (RL) is the Markovian policy, in which an agent's actions are chosen from a memoryless probability distribution, conditioned only on its current state. The family of Markovian policies is broad enough to be interesting, yet simple enough to be amenable to analysis. However, RL often involves more complex policies: ensembles of policies, policies over options, policies updated online, etc. Our main contribution is to prove that the occupancy measure of any non-Markovian policy, i.e., the distribution of transition samples collected with it, can be equivalently generated by a Markovian policy. This result allows theorems about the Markovian policy class to be directly extended to its non-Markovian counterpart, greatly simplifying proofs, in particular those involving replay buffers and datasets. We provide various examples of such applications to the field of Reinforcement Learning.
    Counterfactual Fairness with Partially Known Causal Graph. (arXiv:2205.13972v1 [cs.LG])
    Fair machine learning aims to avoid treating individuals or sub-populations unfavourably based on \textit{sensitive attributes}, such as gender and race. Those methods in fair machine learning that are built on causal inference ascertain discrimination and bias through causal effects. Though causality-based fair learning is attracting increasing attention, current methods assume the true causal graph is fully known. This paper proposes a general method to achieve the notion of counterfactual fairness when the true causal graph is unknown. To be able to select features that lead to counterfactual fairness, we derive the conditions and algorithms to identify ancestral relations between variables on a \textit{Partially Directed Acyclic Graph (PDAG)}, specifically, a class of causal DAGs that can be learned from observational data combined with domain knowledge. Interestingly, we find that counterfactual fairness can be achieved as if the true causal graph were fully known, when specific background knowledge is provided: the sensitive attributes do not have ancestors in the causal graph. Results on both simulated and real-world datasets demonstrate the effectiveness of our method.
    Learning with Stochastic Orders. (arXiv:2205.13684v1 [stat.ML])
    Learning high-dimensional distributions is often done with explicit likelihood modeling or implicit modeling via minimizing integral probability metrics (IPMs). In this paper, we expand this learning paradigm to stochastic orders, namely, the convex or Choquet order between probability measures. Towards this end, we introduce the Choquet-Toland distance between probability measures, that can be used as a drop-in replacement for IPMs. We also introduce the Variational Dominance Criterion (VDC) to learn probability measures with dominance constraints, that encode the desired stochastic order between the learned measure and a known baseline. We analyze both quantities and show that they suffer from the curse of dimensionality and propose surrogates via input convex maxout networks (ICMNs), that enjoy parametric rates. Finally, we provide a min-max framework for learning with stochastic orders and validate it experimentally on synthetic and high-dimensional image generation, with promising results. The code is available at https://github.com/yair-schiff/stochastic-orders-ICMN
    Transformers from an Optimization Perspective. (arXiv:2205.13891v1 [cs.LG])
    Deep learning models such as the Transformer are often constructed by heuristics and experience. To provide a complementary foundation, in this work we study the following problem: Is it possible to find an energy function underlying the Transformer model, such that descent steps along this energy correspond with the Transformer forward pass? By finding such a function, we can reinterpret Transformers as the unfolding of an interpretable optimization process across iterations. This unfolding perspective has been frequently adopted in the past to elucidate more straightforward deep models such as MLPs and CNNs; however, it has thus far remained elusive obtaining a similar equivalence for more complex models with self-attention mechanisms like the Transformer. To this end, we first outline several major obstacles before providing companion techniques to at least partially address them, demonstrating for the first time a close association between energy function minimization and deep layers with self-attention. This interpretation contributes to our intuition and understanding of Transformers, while potentially laying the ground-work for new model designs.
    Off-Beat Multi-Agent Reinforcement Learning. (arXiv:2205.13718v1 [cs.MA])
    We investigate model-free multi-agent reinforcement learning (MARL) in environments where off-beat actions are prevalent, i.e., all actions have pre-set execution durations. During execution durations, the environment changes are influenced by, but not synchronised with, action execution. Such a setting is ubiquitous in many real-world problems. However, most MARL methods assume actions are executed immediately after inference, which is often unrealistic and can lead to catastrophic failure for multi-agent coordination with off-beat actions. In order to fill this gap, we develop an algorithmic framework for MARL with off-beat actions. We then propose a novel episodic memory, LeGEM, for model-free MARL algorithms. LeGEM builds agents' episodic memories by utilizing agents' individual experiences. It boosts multi-agent learning by addressing the challenging temporal credit assignment problem raised by the off-beat actions via our novel reward redistribution scheme, alleviating the issue of non-Markovian reward. We evaluate LeGEM on various multi-agent scenarios with off-beat actions, including Stag-Hunter Game, Quarry Game, Afforestation Game, and StarCraft II micromanagement tasks. Empirical results show that LeGEM significantly boosts multi-agent coordination and achieves leading performance and improved sample efficiency.
    HOUDINI: Escaping from Moderately Constrained Saddles. (arXiv:2205.13753v1 [cs.LG])
    We give the first polynomial time algorithms for escaping from high-dimensional saddle points under a moderate number of constraints. Given gradient access to a smooth function $f \colon \mathbb R^d \to \mathbb R$ we show that (noisy) gradient descent methods can escape from saddle points under a logarithmic number of inequality constraints. This constitutes the first tangible progress (without reliance on NP-oracles or altering the definitions to only account for certain constraints) on the main open question of the breakthrough work of Ge et al. who showed an analogous result for unconstrained and equality-constrained problems. Our results hold for both regular and stochastic gradient descent.
    End-to-End Learning of Hybrid Inverse Dynamics Models for Precise and Compliant Impedance Control. (arXiv:2205.13804v1 [cs.RO])
    It is well-known that inverse dynamics models can improve tracking performance in robot control. These models need to precisely capture the robot dynamics, which consist of well-understood components, e.g., rigid body dynamics, and effects that remain challenging to capture, e.g., stick-slip friction and mechanical flexibilities. Such effects exhibit hysteresis and partial observability, rendering them, particularly challenging to model. Hence, hybrid models, which combine a physical prior with data-driven approaches are especially well-suited in this setting. We present a novel hybrid model formulation that enables us to identify fully physically consistent inertial parameters of a rigid body dynamics model which is paired with a recurrent neural network architecture, allowing us to capture unmodeled partially observable effects using the network memory. We compare our approach against state-of-the-art inverse dynamics models on a 7 degree of freedom manipulator. Using data sets obtained through an optimal experiment design approach, we study the accuracy of offline torque prediction and generalization capabilities of joint learning methods. In control experiments on the real system, we evaluate the model as a feed-forward term for impedance control and show the feedback gains can be drastically reduced to achieve a given tracking accuracy.
    Multivariate Probabilistic Forecasting of Intraday Electricity Prices using Normalizing Flows. (arXiv:2205.13826v1 [cs.LG])
    Electricity is traded on various markets with different time horizons and regulations. Short-term trading becomes increasingly important due to higher penetration of renewables. In Germany, the intraday electricity price typically fluctuates around the day-ahead price of the EPEX spot markets in a distinct hourly pattern. This work proposes a probabilistic modeling approach that models the intraday price difference to the day-ahead contracts. The model captures the emerging hourly pattern by considering the four 15 min intervals in each day-ahead price interval as a four-dimensional joint distribution. The resulting nontrivial, multivariate price difference distribution is learned using a normalizing flow, i.e., a deep generative model that combines conditional multivariate density estimation and probabilistic regression. The normalizing flow is compared to a selection of historical data, a Gaussian copula, and a Gaussian regression model. Among the different models, the normalizing flow identifies the trends most accurately and has the narrowest prediction intervals. Notably, the normalizing flow is the only approach that identifies rare price peaks. Finally, this work discusses the influence of different external impact factors and finds that, individually, most of these factors have negligible impact. Only the immediate history of the price difference realization and the combination of all input factors lead to notable improvements in the forecasts.
    Comparison of Deep Learning Segmentation and Multigrader-annotated Mandibular Canals of Multicenter CBCT scans. (arXiv:2205.13874v1 [cs.LG])
    Deep learning approach has been demonstrated to automatically segment the bilateral mandibular canals from CBCT scans, yet systematic studies of its clinical and technical validation are scarce. To validate the mandibular canal localization accuracy of a deep learning system (DLS) we trained it with 982 CBCT scans and evaluated using 150 scans of five scanners from clinical workflow patients of European and Southeast Asian Institutes, annotated by four radiologists. The interobserver variability was compared to the variability between the DLS and the radiologists. In addition, the generalization of DLS to CBCT scans from scanners not used in the training data was examined to evaluate the out-of-distribution generalization capability. The DLS had lower variability to the radiologists than the interobserver variability between them and it was able to generalize to three new devices. For the radiologists' consensus segmentation, used as gold standard, the DLS had a symmetric mean curve distance of 0.39 mm compared to those of the individual radiologists with 0.62 mm, 0.55 mm, 0.47 mm, and 0.42 mm. The DLS showed comparable or slightly better performance in the segmentation of the mandibular canal with the radiologists and generalization capability to new scanners.
    Deep Reinforcement Learning for Distributed and Uncoordinated Cognitive Radios Resource Allocation. (arXiv:2205.13944v1 [cs.LG])
    This paper presents a novel deep reinforcement learning-based resource allocation technique for the multi-agent environment presented by a cognitive radio network where the interactions of the agents during learning may lead to a non-stationary environment. The resource allocation technique presented in this work is distributed, not requiring coordination with other agents. It is shown by considering aspects specific to deep reinforcement learning that the presented algorithm converges in an arbitrarily long time to equilibrium policies in a non-stationary multi-agent environment that results from the uncoordinated dynamic interaction between radios through the shared wireless environment. Simulation results show that the presented technique achieves a faster learning performance compared to an equivalent table-based Q-learning algorithm and is able to find the optimal policy in 99% of cases for a sufficiently long learning time. In addition, simulations show that our DQL approach requires less than half the number of learning steps to achieve the same performance as an equivalent table-based implementation. Moreover, it is shown that the use of a standard single-agent deep reinforcement learning approach may not achieve convergence when used in an uncoordinated interacting multi-radio scenario
    Group GAN. (arXiv:2205.13741v1 [cs.LG])
    Generating multivariate time series is a promising approach for sharing sensitive data in many medical, financial, and IoT applications. A common type of multivariate time series originates from a single source such as the biometric measurements from a medical patient. This leads to complex dynamical patterns between individual time series that are hard to learn by typical generation models such as GANs. There is valuable information in those patterns that machine learning models can use to better classify, predict or perform other downstream tasks. We propose a novel framework that takes time series' common origin into account and favors inter-channel relationship preservation. The two key points of our method are: 1) the individual time series are generated from a common point in latent space and 2) a central discriminator favors the preservation of inter-channel dynamics. We demonstrate empirically that our method helps preserve channel correlations and that our synthetic data performs very well downstream tasks with medical and financial data.
    Membership Inference Attack Using Self Influence Functions. (arXiv:2205.13680v1 [cs.LG])
    Member inference (MI) attacks aim to determine if a specific data sample was used to train a machine learning model. Thus, MI is a major privacy threat to models trained on private sensitive data, such as medical records. In MI attacks one may consider the black-box settings, where the model's parameters and activations are hidden from the adversary, or the white-box case where they are available to the attacker. In this work, we focus on the latter and present a novel MI attack for it that employs influence functions, or more specifically the samples' self-influence scores, to perform the MI prediction. We evaluate our attack on CIFAR-10, CIFAR-100, and Tiny ImageNet datasets, using versatile architectures such as AlexNet, ResNet, and DenseNet. Our attack method achieves new state-of-the-art results for both training with and without data augmentations. Code is available at https://github.com/giladcohen/sif_mi_attack.
    Privacy of Noisy Stochastic Gradient Descent: More Iterations without More Privacy Loss. (arXiv:2205.13710v1 [cs.LG])
    A central issue in machine learning is how to train models on sensitive user data. Industry has widely adopted a simple algorithm: Stochastic Gradient Descent with noise (a.k.a. Stochastic Gradient Langevin Dynamics). However, foundational theoretical questions about this algorithm's privacy loss remain open -- even in the seemingly simple setting of smooth convex losses over a bounded domain. Our main result resolves these questions: for a large range of parameters, we characterize the differential privacy up to a constant factor. This result reveals that all previous analyses for this setting have the wrong qualitative behavior. Specifically, while previous privacy analyses increase ad infinitum in the number of iterations, we show that after a small burn-in period, running SGD longer leaks no further privacy. Our analysis departs completely from previous approaches based on fast mixing, instead using techniques based on optimal transport (namely, Privacy Amplification by Iteration) and the Sampled Gaussian Mechanism (namely, Privacy Amplification by Sampling). Our techniques readily extend to other settings, e.g., strongly convex losses, non-uniform stepsizes, arbitrary batch sizes, and random or cyclic choice of batches.
    DP-PCA: Statistically Optimal and Differentially Private PCA. (arXiv:2205.13709v1 [cs.LG])
    We study the canonical statistical task of computing the principal component from $n$ i.i.d.~data in $d$ dimensions under $(\varepsilon,\delta)$-differential privacy. Although extensively studied in literature, existing solutions fall short on two key aspects: ($i$) even for Gaussian data, existing private algorithms require the number of samples $n$ to scale super-linearly with $d$, i.e., $n=\Omega(d^{3/2})$, to obtain non-trivial results while non-private PCA requires only $n=O(d)$, and ($ii$) existing techniques suffer from a non-vanishing error even when the randomness in each data point is arbitrarily small. We propose DP-PCA, which is a single-pass algorithm that overcomes both limitations. It is based on a private minibatch gradient ascent method that relies on {\em private mean estimation}, which adds minimal noise required to ensure privacy by adapting to the variance of a given minibatch of gradients. For sub-Gaussian data, we provide nearly optimal statistical error rates even for $n=\tilde O(d)$. Furthermore, we provide a lower bound showing that sub-Gaussian style assumption is necessary in obtaining the optimal error rate.
    Hazard Gradient Penalty for Survival Analysis. (arXiv:2205.13717v1 [cs.LG])
    Survival analysis appears in various fields such as medicine, economics, engineering, and business. Recent studies showed that the Ordinary Differential Equation (ODE) modeling framework unifies many existing survival models while the framework is flexible and widely applicable. However, naively applying the ODE framework to survival analysis problems may model fiercely changing density function which may worsen the model's performance. Though we can apply L1 or L2 regularizers to the ODE model, their effect on the ODE modeling framework is barely known. In this paper, we propose hazard gradient penalty (HGP) to enhance the performance of a survival analysis model. Our method imposes constraints on local data points by regularizing the gradient of hazard function with respect to the data point. Our method applies to any survival analysis model including the ODE modeling framework and is easy to implement. We theoretically show that our method is related to minimizing the KL divergence between the density function at a data point and that of the neighborhood points. Experimental results on three public benchmarks show that our approach outperforms other regularization methods.
    Auto-PINN: Understanding and Optimizing Physics-Informed Neural Architecture. (arXiv:2205.13748v1 [cs.LG])
    Physics-informed neural networks (PINNs) are revolutionizing science and engineering practice by bringing together the power of deep learning to bear on scientific computation. In forward modeling problems, PINNs are meshless partial differential equation (PDE) solvers that can handle irregular, high-dimensional physical domains. Naturally, the neural architecture hyperparameters have a large impact on the efficiency and accuracy of the PINN solver. However, this remains an open and challenging problem because of the large search space and the difficulty of identifying a proper search objective for PDEs. Here, we propose Auto-PINN, the first systematic, automated hyperparameter optimization approach for PINNs, which employs Neural Architecture Search (NAS) techniques to PINN design. Auto-PINN avoids manually or exhaustively searching the hyperparameter space associated with PINNs. A comprehensive set of pre-experiments using standard PDE benchmarks allows us to probe the structure-performance relationship in PINNs. We find that the different hyperparameters can be decoupled, and that the training loss function of PINNs is a good search objective. Comparison experiments with baseline methods demonstrate that Auto-PINN produces neural architectures with superior stability and accuracy over alternative baselines.
    Block-coordinate Frank-Wolfe algorithm and convergence analysis for semi-relaxed optimal transport problem. (arXiv:2205.13766v1 [cs.LG])
    The optimal transport (OT) problem has been used widely for machine learning. It is necessary for computation of an OT problem to solve linear programming with tight mass-conservation constraints. These constraints prevent its application to large-scale problems. To address this issue, loosening such constraints enables us to propose the relaxed-OT method using a faster algorithm. This approach has demonstrated its effectiveness for applications. However, it remains slow. As a superior alternative, we propose a fast block-coordinate Frank-Wolfe (BCFW) algorithm for a convex semi-relaxed OT. Specifically, we prove their upper bounds of the worst convergence iterations, and equivalence between the linearization duality gap and the Lagrangian duality gap. Additionally, we develop two fast variants of the proposed BCFW. Numerical experiments have demonstrated that our proposed algorithms are effective for color transfer and surpass state-of-the-art algorithms. This report presents a short version of arXiv:2103.05857.
    Chaos is a Ladder: A New Theoretical Understanding of Contrastive Learning via Augmentation Overlap. (arXiv:2203.13457v2 [cs.LG] UPDATED)
    Recently, contrastive learning has risen to be a promising approach for large-scale self-supervised learning. However, theoretical understanding of how it works is still unclear. In this paper, we propose a new guarantee on the downstream performance without resorting to the conditional independence assumption that is widely adopted in previous work but hardly holds in practice. Our new theory hinges on the insight that the support of different intra-class samples will become more overlapped under aggressive data augmentations, thus simply aligning the positive samples (augmented views of the same sample) could make contrastive learning cluster intra-class samples together. Based on this augmentation overlap perspective, theoretically, we obtain asymptotically closed bounds for downstream performance under weaker assumptions, and empirically, we propose an unsupervised model selection metric ARC that aligns well with downstream accuracy. Our theory suggests an alternative understanding of contrastive learning: the role of aligning positive samples is more like a surrogate task than an ultimate goal, and the overlapped augmented views (i.e., the chaos) create a ladder for contrastive learning to gradually learn class-separated representations. The code for computing ARC is available at https://github.com/zhangq327/ARC.
    Transformer for Partial Differential Equations' Operator Learning. (arXiv:2205.13671v1 [cs.LG])
    Data-driven learning of partial differential equations' solution operators has recently emerged as a promising paradigm for approximating the underlying solutions. The solution operators are usually parameterized by deep learning models that are built upon problem-specific inductive biases. An example is a convolutional or a graph neural network that exploits the local grid structure where functions' values are sampled. The attention mechanism, on the other hand, provides a flexible way to implicitly exploit the patterns within inputs, and furthermore, relationship between arbitrary query locations and inputs. In this work, we present an attention-based framework for data-driven operator learning, which we term Operator Transformer (OFormer). Our framework is built upon self-attention, cross-attention, and a set of point-wise multilayer perceptrons (MLPs), and thus it makes few assumptions on the sampling pattern of the input function or query locations. We show that the proposed framework is competitive on standard benchmark problems and can flexibly be adapted to randomly sampled input.
    Mixed Federated Learning: Joint Decentralized and Centralized Learning. (arXiv:2205.13655v1 [cs.LG])
    Federated learning (FL) enables learning from decentralized privacy-sensitive data, with computations on raw data confined to take place at edge clients. This paper introduces mixed FL, which incorporates an additional loss term calculated at the coordinating server (while maintaining FL's private data restrictions). There are numerous benefits. For example, additional datacenter data can be leveraged to jointly learn from centralized (datacenter) and decentralized (federated) training data and better match an expected inference data distribution. Mixed FL also enables offloading some intensive computations (e.g., embedding regularization) to the server, greatly reducing communication and client computation load. For these and other mixed FL use cases, we present three algorithms: PARALLEL TRAINING, 1-WAY GRADIENT TRANSFER, and 2-WAY GRADIENT TRANSFER. We state convergence bounds for each, and give intuition on which are suited to particular mixed FL problems. Finally we perform extensive experiments on three tasks, demonstrating that mixed FL can blend training data to achieve an oracle's accuracy on an inference distribution, and can reduce communication and computation overhead by over 90%. Our experiments confirm theoretical predictions of how algorithms perform under different mixed FL problem settings.
    Fight Poison with Poison: Detecting Backdoor Poison Samples via Decoupling Benign Correlations. (arXiv:2205.13616v1 [cs.LG])
    In this work, we study poison samples detection for defending against backdoor poisoning attacks on deep neural networks (DNNs). A principled idea underlying prior arts on this problem is to utilize the backdoored models' distinguishable behaviors on poison and clean populations to distinguish between these two different populations themselves and remove the identified poison. Many prior arts build their detectors upon a latent separability assumption, which states that backdoored models trained on the poisoned dataset will learn separable latent representations for backdoor and clean samples. Although such separation behaviors empirically exist for many existing attacks, there is no control on the separability and the extent of separation can vary a lot across different poison strategies, datasets, as well as the training configurations of backdoored models. Worse still, recent adaptive poison strategies can greatly reduce the "distinguishable behaviors" and consequently render most prior arts less effective (or completely fail). We point out that these limitations directly come from the passive reliance on some distinguishable behaviors that are not controlled by defenders. To mitigate such limitations, in this work, we propose the idea of active defense -- rather than passively assuming backdoored models will have certain distinguishable behaviors on poison and clean samples, we propose to actively enforce the trained models to behave differently on these two different populations. Specifically, we introduce confusion training as a concrete instance of active defense.
    Asymptotic Convergence Rate and Statistical Inference for Stochastic Sequential Quadratic Programming. (arXiv:2205.13687v1 [math.OC])
    We apply a stochastic sequential quadratic programming (StoSQP) algorithm to solve constrained nonlinear optimization problems, where the objective is stochastic and the constraints are deterministic. We study a fully stochastic setup, where only a single sample is available in each iteration for estimating the gradient and Hessian of the objective. We allow StoSQP to select a random stepsize $\bar{\alpha}_t$ adaptively, such that $\beta_t\leq \bar{\alpha}_t \leq \beta_t+\chi_t$, where $\beta_t$, $\chi_t=o(\beta_t)$ are prespecified deterministic sequences. We also allow StoSQP to solve Newton system inexactly via randomized iterative solvers, e.g., with the sketch-and-project method; and we do not require the approximation error of inexact Newton direction to vanish. For this general StoSQP framework, we establish the asymptotic convergence rate for its last iterate, with the worst-case iteration complexity as a byproduct; and we perform statistical inference. In particular, with proper decaying $\beta_t,\chi_t$, we show that: (i) the StoSQP scheme can take at most $O(1/\epsilon^4)$ iterations to achieve $\epsilon$-stationarity; (ii) asymptotically and almost surely, $\|(x_t -x^\star, \lambda_t - \lambda^\star)\| = O(\sqrt{\beta_t\log(1/\beta_t)})+O(\chi_t/\beta_t)$, where $(x_t,\lambda_t)$ is the primal-dual StoSQP iterate; (iii) the sequence $1/\sqrt{\beta_t}\cdot (x_t -x^\star, \lambda_t - \lambda^\star)$ converges to a mean zero Gaussian distribution with a nontrivial covariance matrix. Moreover, we establish the Berry-Esseen bound for $(x_t, \lambda_t)$ to measure quantitatively the convergence of its distribution function. We also provide a practical estimator for the covariance matrix, from which the confidence intervals of $(x^\star, \lambda^\star)$ can be constructed using iterates $\{(x_t,\lambda_t)\}_t$. Our theorems are validated using nonlinear problems in CUTEst test set.  ( 2 min )
    Safety Aware Changepoint Detection for Piecewise i.i.d. Bandits. (arXiv:2205.13689v1 [cs.LG])
    In this paper, we consider the setting of piecewise i.i.d. bandits under a safety constraint. In this piecewise i.i.d. setting, there exists a finite number of changepoints where the mean of some or all arms change simultaneously. We introduce the safety constraint studied in \citet{wu2016conservative} to this setting such that at any round the cumulative reward is above a constant factor of the default action reward. We propose two actively adaptive algorithms for this setting that satisfy the safety constraint, detect changepoints, and restart without the knowledge of the number of changepoints or their locations. We provide regret bounds for our algorithms and show that the bounds are comparable to their counterparts from the safe bandit and piecewise i.i.d. bandit literature. We also provide the first matching lower bounds for this setting. Empirically, we show that our safety-aware algorithms perform similarly to the state-of-the-art actively adaptive algorithms that do not satisfy the safety constraint.  ( 2 min )
    Contextual Adapters for Personalized Speech Recognition in Neural Transducers. (arXiv:2205.13660v1 [cs.CL])
    Personal rare word recognition in end-to-end Automatic Speech Recognition (E2E ASR) models is a challenge due to the lack of training data. A standard way to address this issue is with shallow fusion methods at inference time. However, due to their dependence on external language models and the deterministic approach to weight boosting, their performance is limited. In this paper, we propose training neural contextual adapters for personalization in neural transducer based ASR models. Our approach can not only bias towards user-defined words, but also has the flexibility to work with pretrained ASR models. Using an in-house dataset, we demonstrate that contextual adapters can be applied to any general purpose pretrained ASR model to improve personalization. Our method outperforms shallow fusion, while retaining functionality of the pretrained models by not altering any of the model weights. We further show that the adapter style training is superior to full-fine-tuning of the ASR models on datasets with user-defined content.  ( 2 min )
    SeedGNN: Graph Neural Networks for Supervised Seeded Graph Matching. (arXiv:2205.13679v1 [cs.LG])
    Recently, there have been significant interests in designing Graph Neural Networks (GNNs) for seeded graph matching, which aims to match two (unlabeled) graphs using only topological information and a small set of seeds. However, most previous GNN architectures for seeded graph matching employ a semi-supervised approach, which learns from only the seed set in a single pair of graphs, and therefore does not attempt to learn from many training examples/graphs to best match future unseen graphs. In contrast, this paper is the first to propose a supervised approach for seeded graph matching, which had so far only been used for seedless graph matching. Our proposed SeedGNN architecture employs a number of novel design choices that are inspired by theoretical studies of seeded graph matching. First, SeedGNN can easily learn the capability of counting and using witnesses of different hops, in a way that can be generalized to graphs with different sizes. Second, SeedGNN can use easily-matched pairs as new seeds to percolate and match other nodes. We evaluate SeedGNN on both synthetic and real graphs, and demonstrate significant performance improvement over both non-learning and learning algorithms in the existing literature. Further, our experiments confirm that the knowledge learned by SeedGNN from training graphs can be generalized to test graphs with different sizes and categories.  ( 2 min )
    Consistent and fast inference in compartmental models of epidemics using Poisson Approximate Likelihoods. (arXiv:2205.13602v1 [stat.ME])
    Addressing the challenge of scaling-up epidemiological inference to complex and heterogeneous models, we introduce Poisson Approximate Likelihood (PAL) methods. In contrast to the popular ODE approach to compartmental modelling, in which a large population limit is used to motivate a deterministic model, PALs are derived from approximate filtering equations for finite-population, stochastic compartmental models, and the large population limit drives the consistency of maximum PAL estimators. Our theoretical results appear to be the first likelihood-based parameter estimation consistency results applicable across a broad class of partially observed stochastic compartmental models. Compared to simulation-based methods such as Approximate Bayesian Computation and Sequential Monte Carlo, PALs are simple to implement, involving only elementary arithmetic operations and no tuning parameters; and fast to evaluate, requiring no simulation from the model and having computational cost independent of population size. Through examples, we demonstrate how PALs can be: embedded within Delayed Acceptance Particle Markov Chain Monte Carlo to facilitate Bayesian inference; used to fit an age-structured model of influenza, taking advantage of automatic differentiation in Stan; and applied to calibrate a spatial meta-population model of measles.  ( 2 min )
    Approximate Q-learning and SARSA(0) under the $\epsilon$-greedy Policy: a Differential Inclusion Analysis. (arXiv:2205.13617v1 [cs.LG])
    Q-learning and SARSA(0) with linear function approximation, under $\epsilon$-greedy exploration, are leading methods to estimate the optimal policy in Reinforcement Learning (RL). It has been empirically known that the discontinuous nature of the greedy policies causes these algorithms to exhibit complex phenomena such as i.) instability, ii.) policy oscillation and chattering, iii.) multiple attractors, and iv.) worst policy convergence. However, the literature lacks a formal recipe to explain these behaviors and this has been a long-standing open problem (Sutton, 1999). Our work addresses this by building the necessary mathematical framework using stochastic recursive inclusions and Differential Inclusions (DIs). From this novel viewpoint, our main result states that these approximate algorithms asymptotically converge to suitable invariant sets of DIs instead of differential equations, as is common elsewhere in RL. Furthermore, the nature of these deterministic DIs completely governs the limiting behaviors of these algorithms.  ( 2 min )
    Incorporating the Barzilai-Borwein Adaptive Step Size into Sugradient Methods for Deep Network Training. (arXiv:2205.13711v1 [cs.LG])
    In this paper, we incorporate the Barzilai-Borwein step size into gradient descent methods used to train deep networks. This allows us to adapt the learning rate using a two-point approximation to the secant equation which quasi-Newton methods are based upon. Moreover, the adaptive learning rate method presented here is quite general in nature and can be applied to widely used gradient descent approaches such as Adagrad and RMSprop. We evaluate our method using standard example network architectures on widely available datasets and compare against alternatives elsewhere in the literature. In our experiments, our adaptive learning rate shows a smoother and faster convergence than that exhibited by the alternatives, with better or comparable performance.  ( 2 min )
    A Unified Analysis of Federated Learning with Arbitrary Client Participation. (arXiv:2205.13648v1 [cs.LG])
    Federated learning (FL) faces challenges of intermittent client availability and computation/communication efficiency. As a result, only a small subset of clients can participate in FL at a given time. It is important to understand how partial client participation affects convergence, but most existing works have either considered idealized participation patterns or obtained results with non-zero optimality error for generic patterns. In this paper, we provide a unified convergence analysis for FL with arbitrary client participation. We first introduce a generalized version of federated averaging (FedAvg) that amplifies parameter updates at an interval of multiple FL rounds. Then, we present a novel analysis that captures the effect of client participation in a single term. By analyzing this term, we obtain convergence upper bounds for a wide range of participation patterns, including both non-stochastic and stochastic cases, which match either the lower bound of stochastic gradient descent (SGD) or the state-of-the-art results in specific settings. We also discuss various insights, recommendations, and experimental results.  ( 2 min )
    Learning to Reason with Neural Networks: Generalization, Unseen Data and Boolean Measures. (arXiv:2205.13647v1 [cs.LG])
    This paper considers the Pointer Value Retrieval (PVR) benchmark introduced in [ZRKB21], where a 'reasoning' function acts on a string of digits to produce the label. More generally, the paper considers the learning of logical functions with gradient descent (GD) on neural networks. It is first shown that in order to learn logical functions with gradient descent on symmetric neural networks, the generalization error can be lower-bounded in terms of the noise-stability of the target function, supporting a conjecture made in [ZRKB21]. It is then shown that in the distribution shift setting, when the data withholding corresponds to freezing a single feature (referred to as canonical holdout), the generalization error of gradient descent admits a tight characterization in terms of the Boolean influence for several relevant architectures. This is shown on linear models and supported experimentally on other models such as MLPs and Transformers. In particular, this puts forward the hypothesis that for such architectures and for learning logical functions such as PVR functions, GD tends to have an implicit bias towards low-degree representations, which in turn gives the Boolean influence for the generalization error under quadratic loss.  ( 2 min )
    fakeWeather: Adversarial Attacks for Deep Neural Networks Emulating Weather Conditions on the Camera Lens of Autonomous Systems. (arXiv:2205.13807v1 [cs.LG])
    Recently, Deep Neural Networks (DNNs) have achieved remarkable performances in many applications, while several studies have enhanced their vulnerabilities to malicious attacks. In this paper, we emulate the effects of natural weather conditions to introduce plausible perturbations that mislead the DNNs. By observing the effects of such atmospheric perturbations on the camera lenses, we model the patterns to create different masks that fake the effects of rain, snow, and hail. Even though the perturbations introduced by our attacks are visible, their presence remains unnoticed due to their association with natural events, which can be especially catastrophic for fully-autonomous and unmanned vehicles. We test our proposed fakeWeather attacks on multiple Convolutional Neural Network and Capsule Network models, and report noticeable accuracy drops in the presence of such adversarial perturbations. Our work introduces a new security threat for DNNs, which is especially severe for safety-critical applications and autonomous systems.  ( 2 min )
    A Hybrid Neural Autoencoder for Sensory Neuroprostheses and Its Applications in Bionic Vision. (arXiv:2205.13623v1 [cs.LG])
    Sensory neuroprostheses are emerging as a promising technology to restore lost sensory function or augment human capacities. However, sensations elicited by current devices often appear artificial and distorted. Although current models can often predict the neural or perceptual response to an electrical stimulus, an optimal stimulation strategy solves the inverse problem: what is the required stimulus to produce a desired response? Here we frame this as an end-to-end optimization problem, where a deep neural network encoder is trained to invert a known, fixed forward model that approximates the underlying biological system. As a proof of concept, we demonstrate the effectiveness of our hybrid neural autoencoder (HNA) on the use case of visual neuroprostheses. We found that HNA is able to produce high-fidelity stimuli from the MNIST and COCO datasets that outperform conventional encoding strategies and surrogate techniques across all tested conditions. Overall this is an important step towards the long-standing challenge of restoring high-quality vision to people living with incurable blindness and may prove a promising solution for a variety of neuroprosthetic technologies.  ( 2 min )
    Error Bound of Empirical $\ell_2$ Risk Minimization for Noisy Standard and Generalized Phase Retrieval Problems. (arXiv:2205.13827v1 [stat.ML])
    A noisy generalized phase retrieval (NGPR) problem refers to a problem of estimating $x_0 \in \mathbb{C}^d$ by noisy quadratic samples $\big\{x_0^*A_kx_0+\eta_k\big\}_{k=1}^n$ where $A_k$ is a Hermitian matrix and $\eta_k$ is a noise scalar. When $A_k=\alpha_k\alpha_k^*$ for some $\alpha_k\in\mathbb{C}^d$, it reduces to a standard noisy phase retrieval (NPR) problem. The main aim of this paper is to study the estimation performance of empirical $\ell_2$ risk minimization in both problems when $A_k$ in NGPR, or $\alpha_k$ in NPR, is drawn from sub-Gaussian distribution. Under different kinds of noise patterns, we establish error bounds that can imply approximate reconstruction and these results are new in the literature. In NGPR, we show the bounds are of $O\big(\frac{||\eta||}{\sqrt{n}}\big)$ and $O\big(||\eta||_\infty \sqrt{\frac{d}{n}}\big)$ for general noise, and of $O\big(\sqrt{\frac{d\log n}{n}}\big)$ and $O\big(\sqrt{\frac{d(\log n)^2}{n}}\big)$ for random noise with sub-Gaussian and sub-exponential tail respectively, where $\| \eta \|$ and $\| \eta \|_{\infty}$ are the 2-norm and sup-norm of the noise vector of $\eta_k$. Under heavy-tailed noise, by truncating response outliers we propose a robust estimator that possesses an error bound with slower convergence rate. On the other hand, we obtain in NPR the bound is of $O\big(\sqrt{\frac{d\log n}{n}}\big)$ and $O\big(\sqrt{\frac{d(\log n)^2}{n}}\big)$) for sub-Gaussian and sub-exponential noise respectively, which is essentially tighter than the existing bound $O\big(\frac{||\eta||_2}{\sqrt{n}}\big)$. Although NGPR involving measurement matrix $A_k$ is more computationally demanding than NPR involving measurement vector $\alpha_k$, our results reveal that NGPR exhibits stronger robustness than NPR under biased and deterministic noise. Experimental results are presented to confirm and demonstrate our theoretical findings.
    TraClets: Harnessing the power of computer vision for trajectory classification. (arXiv:2205.13880v1 [cs.CV])
    Due to the advent of new mobile devices and tracking sensors in recent years, huge amounts of data are being produced every day. Therefore, novel methodologies need to emerge that dive through this vast sea of information and generate insights and meaningful information. To this end, researchers have developed several trajectory classification algorithms over the years that are able to annotate tracking data. Similarly, in this research, a novel methodology is presented that exploits image representations of trajectories, called TraClets, in order to classify trajectories in an intuitive humans way, through computer vision techniques. Several real-world datasets are used to evaluate the proposed approach and compare its classification performance to other state-of-the-art trajectory classification algorithms. Experimental results demonstrate that TraClets achieves a classification performance that is comparable to, or in most cases, better than the state-of-the-art, acting as a universal, high-accuracy approach for trajectory classification.
    Explaining Preferences with Shapley Values. (arXiv:2205.13662v1 [stat.ML])
    While preference modelling is becoming one of the pillars of machine learning, the problem of preference explanation remains challenging and underexplored. In this paper, we propose \textsc{Pref-SHAP}, a Shapley value-based model explanation framework for pairwise comparison data. We derive the appropriate value functions for preference models and further extend the framework to model and explain \emph{context specific} information, such as the surface type in a tennis game. To demonstrate the utility of \textsc{Pref-SHAP}, we apply our method to a variety of synthetic and real-world datasets and show that richer and more insightful explanations can be obtained over the baseline.  ( 2 min )
    FedAvg with Fine Tuning: Local Updates Lead to Representation Learning. (arXiv:2205.13692v1 [cs.LG])
    The Federated Averaging (FedAvg) algorithm, which consists of alternating between a few local stochastic gradient updates at client nodes, followed by a model averaging update at the server, is perhaps the most commonly used method in Federated Learning. Notwithstanding its simplicity, several empirical studies have illustrated that the output model of FedAvg, after a few fine-tuning steps, leads to a model that generalizes well to new unseen tasks. This surprising performance of such a simple method, however, is not fully understood from a theoretical point of view. In this paper, we formally investigate this phenomenon in the multi-task linear representation setting. We show that the reason behind generalizability of the FedAvg's output is its power in learning the common data representation among the clients' tasks, by leveraging the diversity among client data distributions via local updates. We formally establish the iteration complexity required by the clients for proving such result in the setting where the underlying shared representation is a linear map. To the best of our knowledge, this is the first such result for any setting. We also provide empirical evidence demonstrating FedAvg's representation learning ability in federated image classification with heterogeneous data.
    Efficient Approximation of Gromov-Wasserstein Distance using Importance Sparsification. (arXiv:2205.13573v1 [cs.LG])
    As a valid metric of metric-measure spaces, Gromov-Wasserstein (GW) distance has shown the potential for the matching problems of structured data like point clouds and graphs. However, its application in practice is limited due to its high computational complexity. To overcome this challenge, we propose a novel importance sparsification method, called Spar-GW, to approximate GW distance efficiently. In particular, instead of considering a dense coupling matrix, our method leverages a simple but effective sampling strategy to construct a sparse coupling matrix and update it with few computations. We demonstrate that the proposed Spar-GW method is applicable to the GW distance with arbitrary ground cost, and it reduces the complexity from $\mathcal{O}(n^4)$ to $\mathcal{O}(n^{2+\delta})$ for an arbitrary small $\delta>0$. In addition, this method can be extended to approximate the variants of GW distance, including the entropic GW distance, the fused GW distance, and the unbalanced GW distance. Experiments show the superiority of our Spar-GW to state-of-the-art methods in both synthetic and real-world tasks.  ( 2 min )
    Learning in Feedback-driven Recurrent Spiking Neural Networks using full-FORCE Training. (arXiv:2205.13585v1 [cs.AI])
    Feedback-driven recurrent spiking neural networks (RSNNs) are powerful computational models that can mimic dynamical systems. However, the presence of a feedback loop from the readout to the recurrent layer de-stabilizes the learning mechanism and prevents it from converging. Here, we propose a supervised training procedure for RSNNs, where a second network is introduced only during the training, to provide hint for the target dynamics. The proposed training procedure consists of generating targets for both recurrent and readout layers (i.e., for a full RSNN system). It uses the recursive least square-based First-Order and Reduced Control Error (FORCE) algorithm to fit the activity of each layer to its target. The proposed full-FORCE training procedure reduces the amount of modifications needed to keep the error between the output and target close to zero. These modifications control the feedback loop, which causes the training to converge. We demonstrate the improved performance and noise robustness of the proposed full-FORCE training procedure to model 8 dynamical systems using RSNNs with leaky integrate and fire (LIF) neurons and rate coding. For energy-efficient hardware implementation, an alternative time-to-first-spike (TTFS) coding is implemented for the full- FORCE training procedure. Compared to rate coding, full-FORCE with TTFS coding generates fewer spikes and facilitates faster convergence to the target dynamics.  ( 2 min )
    Exploration, Exploitation, and Engagement in Multi-Armed Bandits with Abandonment. (arXiv:2205.13566v1 [cs.LG])
    Multi-armed bandit (MAB) is a classic model for understanding the exploration-exploitation trade-off. The traditional MAB model for recommendation systems assumes the user stays in the system for the entire learning horizon. In new online education platforms such as ALEKS or new video recommendation systems such as TikTok and YouTube Shorts, the amount of time a user spends on the app depends on how engaging the recommended contents are. Users may temporarily leave the system if the recommended items cannot engage the users. To understand the exploration, exploitation, and engagement in these systems, we propose a new model, called MAB-A where "A" stands for abandonment and the abandonment probability depends on the current recommended item and the user's past experience (called state). We propose two algorithms, ULCB and KL-ULCB, both of which do more exploration (being optimistic) when the user likes the previous recommended item and less exploration (being pessimistic) when the user does not like the previous item. We prove that both ULCB and KL-ULCB achieve logarithmic regret, $O(\log K)$, where $K$ is the number of visits (or episodes). Furthermore, the regret bound under KL-ULCB is asymptotically sharp. We also extend the proposed algorithms to the general-state setting. Simulation results confirm our theoretical analysis and show that the proposed algorithms have significantly lower regrets than the traditional UCB and KL-UCB, and Q-learning-based algorithms.  ( 2 min )
    Evolution of beliefs in social networks. (arXiv:2205.13587v1 [cs.LG])
    Evolution of beliefs of a society are a product of interactions between people (horizontal transmission) in the society over generations (vertical transmission). Researchers have studied both horizontal and vertical transmission separately. Extending prior work, we propose a new theoretical framework which allows application of tools from Markov chain theory to the analysis of belief evolution via horizontal and vertical transmission. We analyze three cases: static network, randomly changing network, and homophily-based dynamic network. Whereas the former two assume network structure is independent of beliefs, the latter assumes that people tend to communicate with those who have similar beliefs. We prove under general conditions that both static and randomly changing networks converge to a single set of beliefs among all individuals along with the rate of convergence. We prove that homophily-based network structures do not in general converge to a single set of beliefs shared by all and prove lower bounds on the number of different limiting beliefs as a function of initial beliefs. We conclude by discussing implications for prior theories and directions for future work.  ( 2 min )
    Learning Dialogue Representations from Consecutive Utterances. (arXiv:2205.13568v1 [cs.CL])
    Learning high-quality dialogue representations is essential for solving a variety of dialogue-oriented tasks, especially considering that dialogue systems often suffer from data scarcity. In this paper, we introduce Dialogue Sentence Embedding (DSE), a self-supervised contrastive learning method that learns effective dialogue representations suitable for a wide range of dialogue tasks. DSE learns from dialogues by taking consecutive utterances of the same dialogue as positive pairs for contrastive learning. Despite its simplicity, DSE achieves significantly better representation capability than other dialogue representation and universal sentence representation models. We evaluate DSE on five downstream dialogue tasks that examine dialogue representation at different semantic granularities. Experiments in few-shot and zero-shot settings show that DSE outperforms baselines by a large margin. For example, it achieves 13 average performance improvement over the strongest unsupervised baseline in 1-shot intent classification on 6 datasets. We also provide analyses on the benefits and limitations of our model.  ( 2 min )
    Training and Inference on Any-Order Autoregressive Models the Right Way. (arXiv:2205.13554v1 [cs.LG])
    Conditional inference on arbitrary subsets of variables is a core problem in probabilistic inference with important applications such as masked language modeling and image inpainting. In recent years, the family of Any-Order Autoregressive Models (AO-ARMs) -- which includes popular models such as XLNet -- has shown breakthrough performance in arbitrary conditional tasks across a sweeping range of domains. But, in spite of their success, in this paper we identify significant improvements to be made to previous formulations of AO-ARMs. First, we show that AO-ARMs suffer from redundancy in their probabilistic model, i.e., they define the same distribution in multiple different ways. We alleviate this redundancy by training on a smaller set of univariate conditionals that still maintains support for efficient arbitrary conditional inference. Second, we upweight the training loss for univariate conditionals that are evaluated more frequently during inference. Our method leads to improved performance with no compromises on tractability, giving state-of-the-art likelihoods in arbitrary conditional modeling on text (Text8), image (CIFAR10, ImageNet32), and continuous tabular data domains.  ( 2 min )
    Learning black- and gray-box chemotactic PDEs/closures from agent based Monte Carlo simulation data. (arXiv:2205.13545v1 [q-bio.QM])
    We propose a machine learning framework for the data-driven discovery of macroscopic chemotactic Partial Differential Equations (PDEs) -- and the closures that lead to them -- from high-fidelity, individual-based stochastic simulations of E.coli bacterial motility. The fine scale, detailed, hybrid (continuum - Monte Carlo) simulation model embodies the underlying biophysics, and its parameters are informed from experimental observations of individual cells. We exploit Automatic Relevance Determination (ARD) within a Gaussian Process framework for the identification of a parsimonious set of collective observables that parametrize the law of the effective PDEs. Using these observables, in a second step we learn effective, coarse-grained "Keller-Segel class" chemotactic PDEs using machine learning regressors: (a) (shallow) feedforward neural networks and (b) Gaussian Processes. The learned laws can be black-box (when no prior knowledge about the PDE law structure is assumed) or gray-box when parts of the equation (e.g. the pure diffusion part) is known and "hardwired" in the regression process. We also discuss data-driven corrections (both additive and functional) of analytically known, approximate closures.  ( 2 min )
    Dynamic Network Reconfiguration for Entropy Maximization using Deep Reinforcement Learning. (arXiv:2205.13578v1 [cs.LG])
    A key problem in network theory is how to reconfigure a graph in order to optimize a quantifiable objective. Given the ubiquity of networked systems, such work has broad practical applications in a variety of situations, ranging from drug and material design to telecommunications. The large decision space of possible reconfigurations, however, makes this problem computationally intensive. In this paper, we cast the problem of network rewiring for optimizing a specified structural property as a Markov Decision Process (MDP), in which a decision-maker is given a budget of modifications that are performed sequentially. We then propose a general approach based on the Deep Q-Network (DQN) algorithm and graph neural networks (GNNs) that can efficiently learn strategies for rewiring networks. We then discuss a cybersecurity case study, i.e., an application to the computer network reconfiguration problem for intrusion protection. In a typical scenario, an attacker might have a (partial) map of the system they plan to penetrate; if the network is effectively "scrambled", they would not be able to navigate it since their prior knowledge would become obsolete. This can be viewed as an entropy maximization problem, in which the goal is to increase the surprise of the network. Indeed, entropy acts as a proxy measurement of the difficulty of navigating the network topology. We demonstrate the general ability of the proposed method to obtain better entropy gains than random rewiring on synthetic and real-world graphs while being computationally inexpensive, as well as being able to generalize to larger graphs than those seen during training. Simulations of attack scenarios confirm the effectiveness of the learned rewiring strategies.  ( 2 min )
    Differentially Private Decoding in Large Language Models. (arXiv:2205.13621v1 [cs.CL])
    Recent large-scale natural language processing (NLP) systems use a pre-trained Large Language Model (LLM) on massive and diverse corpora as a headstart. In practice, the pre-trained model is adapted to a wide array of tasks via fine-tuning on task-specific datasets. LLMs, while effective, have been shown to memorize instances of training data thereby potentially revealing private information processed during pre-training. The potential leakage might further propagate to the downstream tasks for which LLMs are fine-tuned. On the other hand, privacy-preserving algorithms usually involve retraining from scratch, which is prohibitively expensive for LLMs. In this work, we propose a simple, easy to interpret, and computationally lightweight perturbation mechanism to be applied to an already trained model at the decoding stage. Our perturbation mechanism is model-agnostic and can be used in conjunction with any LLM. We provide theoretical analysis showing that the proposed mechanism is differentially private, and experimental results showing a privacy-utility trade-off.  ( 2 min )
    Tensor Program Optimization with Probabilistic Programs. (arXiv:2205.13603v1 [cs.LG])
    Automatic optimization for tensor programs becomes increasingly important as we deploy deep learning in various environments, and efficient optimization relies on a rich search space and effective search. Most existing efforts adopt a search space which lacks the ability to efficiently enable domain experts to grow the search space. This paper introduces MetaSchedule, a domain-specific probabilistic programming language abstraction to construct a rich search space of tensor programs. Our abstraction allows domain experts to analyze the program, and easily propose stochastic choices in a modular way to compose program transformation accordingly. We also build an end-to-end learning-driven framework to find an optimized program for a given search space. Experimental results show that MetaSchedule can cover the search space used in the state-of-the-art tensor program optimization frameworks in a modular way. Additionally, it empowers domain experts to conveniently grow the search space and modularly enhance the system, which brings 48% speedup on end-to-end deep learning workloads.  ( 2 min )
    Denial-of-Service Attack on Object Detection Model Using Universal Adversarial Perturbation. (arXiv:2205.13618v1 [cs.CV])
    Adversarial attacks against deep learning-based object detectors have been studied extensively in the past few years. The proposed attacks aimed solely at compromising the models' integrity (i.e., trustworthiness of the model's prediction), while adversarial attacks targeting the models' availability, a critical aspect in safety-critical domains such as autonomous driving, have not been explored by the machine learning research community. In this paper, we propose NMS-Sponge, a novel approach that negatively affects the decision latency of YOLO, a state-of-the-art object detector, and compromises the model's availability by applying a universal adversarial perturbation (UAP). In our experiments, we demonstrate that the proposed UAP is able to increase the processing time of individual frames by adding "phantom" objects while preserving the detection of the original objects.  ( 2 min )
    Fairness in Recommendation: A Survey. (arXiv:2205.13619v1 [cs.IR])
    As one of the most pervasive applications of machine learning, recommender systems are playing an important role on assisting human decision making. The satisfaction of users and the interests of platforms are closely related to the quality of the generated recommendation results. However, as a highly data-driven system, recommender system could be affected by data or algorithmic bias and thus generate unfair results, which could weaken the reliance of the systems. As a result, it is crucial to address the potential unfairness problems in recommendation settings. Recently, there has been growing attention on fairness considerations in recommender systems with more and more literature on approaches to promote fairness in recommendation. However, the studies are rather fragmented and lack a systematic organization, thus making it difficult to penetrate for new researchers to the domain. This motivates us to provide a systematic survey of existing works on fairness in recommendation. This survey focuses on the foundations for fairness in recommendation literature. It first presents a brief introduction about fairness in basic machine learning tasks such as classification and ranking in order to provide a general overview of fairness research, as well as introduce the more complex situations and challenges that need to be considered when studying fairness in recommender systems. After that, the survey will introduce fairness in recommendation with a focus on the taxonomies of current fairness definitions, the typical techniques for improving fairness, as well as the datasets for fairness studies in recommendation. The survey also talks about the challenges and opportunities in fairness research with the hope of promoting the fair recommendation research area and beyond.  ( 2 min )
    Circumventing Backdoor Defenses That Are Based on Latent Separability. (arXiv:2205.13613v1 [cs.LG])
    Deep learning models are vulnerable to backdoor poisoning attacks. In particular, adversaries can embed hidden backdoors into a model by only modifying a very small portion of its training data. On the other hand, it has also been commonly observed that backdoor poisoning attacks tend to leave a tangible signature in the latent space of the backdoored model i.e. poison samples and clean samples form two separable clusters in the latent space. These observations give rise to the popularity of latent separability assumption, which states that the backdoored DNN models will learn separable latent representations for poison and clean populations. A number of popular defenses (e.g. Spectral Signature, Activation Clustering, SCAn, etc.) are exactly built upon this assumption. However, in this paper, we show that the latent separation can be significantly suppressed via designing adaptive backdoor poisoning attacks with more sophisticated poison strategies, which consequently render state-of-the-art defenses based on this assumption less effective (and often completely fail). More interestingly, we find that our adaptive attacks can even evade some other typical backdoor defenses that do not explicitly build on this separability assumption. Our results show that adaptive backdoor poisoning attacks that can breach the latent separability assumption should be seriously considered for evaluating existing and future defenses.  ( 2 min )
    Self-supervised Pretraining and Transfer Learning Enable Flu and COVID-19 Predictions in Small Mobile Sensing Datasets. (arXiv:2205.13607v1 [cs.LG])
    Detailed mobile sensing data from phones, watches, and fitness trackers offer an unparalleled opportunity to quantify and act upon previously unmeasurable behavioral changes in order to improve individual health and accelerate responses to emerging diseases. Unlike in natural language processing and computer vision, deep representation learning has yet to broadly impact this domain, in which the vast majority of research and clinical applications still rely on manually defined features and boosted tree models or even forgo predictive modeling altogether due to insufficient accuracy. This is due to unique challenges in the behavioral health domain, including very small datasets (~10^1 participants), which frequently contain missing data, consist of long time series with critical long-range dependencies (length>10^4), and extreme class imbalances (>10^3:1).  ( 2 min )
    Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes. (arXiv:2205.13589v1 [cs.LG])
    We study offline reinforcement learning (RL) in partially observable Markov decision processes. In particular, we aim to learn an optimal policy from a dataset collected by a behavior policy which possibly depends on the latent state. Such a dataset is confounded in the sense that the latent state simultaneously affects the action and the observation, which is prohibitive for existing offline RL algorithms. To this end, we propose the \underline{P}roxy variable \underline{P}essimistic \underline{P}olicy \underline{O}ptimization (\texttt{P3O}) algorithm, which addresses the confounding bias and the distributional shift between the optimal and behavior policies in the context of general function approximation. At the core of \texttt{P3O} is a coupled sequence of pessimistic confidence regions constructed via proximal causal inference, which is formulated as minimax estimation. Under a partial coverage assumption on the confounded dataset, we prove that \texttt{P3O} achieves a $n^{-1/2}$-suboptimality, where $n$ is the number of trajectories in the dataset. To our best knowledge, \texttt{P3O} is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset.  ( 2 min )
    Low-rank lottery tickets: finding efficient low-rank neural networks via matrix differential equations. (arXiv:2205.13571v1 [cs.LG])
    Neural networks have achieved tremendous success in a large variety of applications. However, their memory footprint and computational demand can render them impractical in application settings with limited hardware or energy resources. In this work, we propose a novel algorithm to find efficient low-rank subnetworks. Remarkably, these subnetworks are determined and adapted already during the training phase and the overall time and memory resources required by both training and evaluating them is significantly reduced. The main idea is to restrict the weight matrices to a low-rank manifold and to update the low-rank factors rather than the full matrix during training. To derive training updates that are restricted to the prescribed manifold, we employ techniques from dynamic model order reduction for matrix differential equations. Moreover, our method automatically and dynamically adapts the ranks during training to achieve a desired approximation accuracy. The efficiency of the proposed method is demonstrated through a variety of numerical experiments on fully-connected and convolutional networks.  ( 2 min )
    Pruning has a disparate impact on model accuracy. (arXiv:2205.13574v1 [cs.LG])
    Network pruning is a widely-used compression technique that is able to significantly scale down overparameterized models with minimal loss of accuracy. This paper shows that pruning may create or exacerbate disparate impacts. The paper sheds light on the factors to cause such disparities, suggesting differences in gradient norms and distance to decision boundary across groups to be responsible for this critical issue. It analyzes these factors in detail, providing both theoretical and empirical support, and proposes a simple, yet effective, solution that mitigates the disparate impacts caused by pruning.  ( 2 min )
    Understanding new tasks through the lens of training data via exponential tilting. (arXiv:2205.13577v1 [cs.LG])
    Deploying machine learning models to new tasks is a major challenge despite the large size of the modern training datasets. However, it is conceivable that the training data can be reweighted to be more representative of the new (target) task. We consider the problem of reweighing the training samples to gain insights into the distribution of the target task. Specifically, we formulate a distribution shift model based on the exponential tilt assumption and learn train data importance weights minimizing the KL divergence between labeled train and unlabeled target datasets. The learned train data weights can then be used for downstream tasks such as target performance evaluation, fine-tuning, and model selection. We demonstrate the efficacy of our method on Waterbirds and Breeds benchmarks.  ( 2 min )
    Unequal Covariance Awareness for Fisher Discriminant Analysis and Its Variants in Classification. (arXiv:2205.13565v1 [cs.LG])
    Fisher Discriminant Analysis (FDA) is one of the essential tools for feature extraction and classification. In addition, it motivates the development of many improved techniques based on the FDA to adapt to different problems or data types. However, none of these approaches make use of the fact that the assumption of equal covariance matrices in FDA is usually not satisfied in practical situations. Therefore, we propose a novel classification rule for the FDA that accounts for this fact, mitigating the effect of unequal covariance matrices in the FDA. Furthermore, since we only modify the classification rule, the same can be applied to many FDA variants, improving these algorithms further. Theoretical analysis reveals that the new classification rule allows the implicit use of the class covariance matrices while increasing the number of parameters to be estimated by a small amount compared to going from FDA to Quadratic Discriminant Analysis. We illustrate our idea via experiments, which show the superior performance of the modified algorithms based on our new classification rule compared to the original ones.  ( 2 min )
    Predictor-corrector algorithms for stochastic optimization under gradual distribution shift. (arXiv:2205.13575v1 [cs.LG])
    Time-varying stochastic optimization problems frequently arise in machine learning practice (e.g. gradual domain shift, object tracking, strategic classification). Although most problems are solved in discrete time, the underlying process is often continuous in nature. We exploit this underlying continuity by developing predictor-corrector algorithms for time-varying stochastic optimizations. We provide error bounds for the iterates, both in presence of pure and noisy access to the queries from the relevant derivatives of the loss function. Furthermore, we show (theoretically and empirically in several examples) that our method outperforms non-predictor corrector methods that do not exploit the underlying continuous process.  ( 2 min )

  • Open

    Update on previous post, this actually scared me
    submitted by /u/Varitiuss29 [link] [comments]
    Translate a pdf using gpt3
    submitted by /u/Varitiuss29 [link] [comments]
    Baidu AI Researchers Introduce SE-MoE That Proposes Elastic MoE Training With 2D Prefetch And Fusion Communication Over Hierarchical Storage
    Machine learning and deep learning have gained popularity in domains, like computer vision (CV) and natural language processing (NLP), which require analyzing large amounts of data such as images and text. As a result, many computational resources are needed for data processing. Thus, to address the above concern, sparsely activated neural networks based on Mixture-of-Experts (MoE) have been utilized for training the larger models with low or no supplementary computational resources while achieving improved training results. Accordingly, to overcome the challenges faced by the MoE, this paper proposes an innovative amalgamated framework for MoE training and inference. The paper’s significant contribution includes a novel SE-MoE, a distributed system capable of scaling MoE models to trillions of parameters and completely exploiting the clusters, including High Bandwidth Memory, CPU memory, and SSDs in achieving effective training scheduling. Dynamic graph scheduling utilizes an innovative inference approach based on ring memory to overlap computation and communication as much as feasible, resulting in more efficient inference performance without extra machines for larger-scale MoE models. Additionally, various methods like load balancing are utilized by the SE-MoE to advance the performance without any additional resources. Continue reading | Check out the paper, and Github submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Software’s Job
    Hi group. Is it possible to use an image of a form to fullfill the same form but in a web application? That is, the software has to recognize the form’s controls and then trough an API or something, fullfill the controls but in the web app. Just looking for feedback and if it is complicated. Thanks, and sorry for my english. submitted by /u/ScallionFunny3560 [link] [comments]  ( 1 min )
    Recursively summarize text of any length with GPT-3
    submitted by /u/DavidKShapiro [link] [comments]
    Doctor Who - 4K Neural Art Exploration
    submitted by /u/MLInsights [link] [comments]
    Self Taught AI engineers
    Is it possible to learn AI only by self taught? What are the best online course's, resource's to master AI? submitted by /u/Titan_D [link] [comments]  ( 1 min )
    Hey there. I, eleventacion/very odd will be posting this AI generated (text to speech) track on my YouTube channel very soon. It's not the best you will hear but it will be pretty pleasing. Love you all.
    submitted by /u/eleventacion [link] [comments]  ( 1 min )
    Google bans deepfake training in Colab
    submitted by /u/Zirius_Sadfaces [link] [comments]  ( 1 min )
    Giant Squid - 4K Neural Art Visualization
    submitted by /u/MLInsights [link] [comments]
    Knightian uncertainty
    Knightian uncertainty is basically uncertainty that we can't quantify. For me that seems to be one of the topics where artificial intelligence doesn't make much sense conceptually. But clearly a strong majority in the AI field disagrees that AI has limits in that way. How - in the context of AI - do you conceptually make sense of unquantifiable uncertainty, or related topics that involve unquantifiability and non-representability such as spaciousness (which is more allusive or loose as a concept rather than representing anything per se). If you have meditated before you might also have discovered that experience of spaciousness can have profound impact on the way we act in the world, so I don't think it can be ignored as far as behaviour goes either. It seems if there is such a thing as general and conscious AI those are some foundational issues to make sense of in order to get there. After all experience of space and uncertainty are basically core aspects of our being in the world. submitted by /u/bejaq [link] [comments]  ( 3 min )
  • Open

    Benefits of Power BI for Small Businesses
    There is no way one can deny the importance of data in the business scene. Companies of all types and sizes are using mammoth amounts of data on a daily basis. They need to collect data from various sources and then compile and analyze it using specialized applications. This is why BI and data visualization… Read More »Benefits of Power BI for Small Businesses The post Benefits of Power BI for Small Businesses appeared first on Data Science Central.  ( 4 min )
    Best Practices for Implementing Breadcrumb SEO Strategy for Mobiles
    The design and position of breadcrumb navigation on a webpage is typical and has become an established practice for a long time. However, as the world shifts to a mobile-first web environment, many website designers are getting it wrong or forgetting to include it in their navigation. Doing this can be a blunder because it… Read More »Best Practices for Implementing Breadcrumb SEO Strategy for Mobiles The post Best Practices for Implementing Breadcrumb SEO Strategy for Mobiles appeared first on Data Science Central.  ( 5 min )
  • Open

    [P] Extract formula from ONNX file possible?
    At the end of building a NN, there is a formula. It's a mathematical formula that can be evaluated for given inputs, depending on the trained weights. Is it possible to "extract" or "read" that formula from an ONNX file? When I look at it in a text editor, it's not human-readable. But it should be straight-forward to evaluate for given inputs, that's how inference works: just evaluate the formula. How can I actually *get* that formula? For a not-so-deep NN that mathematical expression should be pretty easy. Thanks! submitted by /u/CantFixMoronic [link] [comments]  ( 1 min )
    [D] Gwern’s Retrospective on the 2 Year Anniversary of GPT3 Release
    submitted by /u/programmerChilli [link] [comments]
    [D] Any state of the art paper close to Google's paper "Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers"?
    I am working on this paper and was wondering if there's any other paper that has used the same approaches vis-a-vis image-text retrieval. This paper uses a combination of Transformers with encoders for image retrieval. I can't find the code implementation of this paper anywhere and was wondering maybe a similar paper will help in that regard. submitted by /u/icelebratefestivus [link] [comments]  ( 1 min )
    [N] American Express Default Prediction ML Competition
    https://www.kaggle.com/competitions/amex-default-prediction/overview submitted by /u/EducationalCicada [link] [comments]
  • Open

    How do you limit the high frequency agent actions when dealing with continuous control?
    I am tuning an SAC agent for a robotics control task. The action space of the agent is a single dimensional decision in [-1, 1]. I see that very often the agent takes advantage of the fact that the action can be varied with a very high frequency, basically filling up the plot. I've already implemented an incremental version of the agent, where it actually controls a derivative of the control action and the actual action is part of the observation space, which helps a lot with the realism of the robotics problem. Now the problem has been sort of moved one time-derivative lower and the high frequency content of the action is the rate of change of the control input. Is there a way to do some reward-shaping or some other method to prevent this? I've also tried just straight up adding a penalty term to the absolute value of the action but it comes with degraded performance. submitted by /u/Speterius [link] [comments]  ( 1 min )
    Environments that require long-term memory and reasoning
    Could you recommend me some Reinforcement Learning environments that require long-term memory and reasoning? By long-term memory, I mean environments in which: - Evolution of the environment depends on states or actions experienced in the past, possibly if this dependency is long (i.e. several timesteps) - When re-visiting areas of the state space, these might be different depending on what the agent previously did in those areas Example: a robot does cleaning jobs in a house. When switching between tasks (e.g. cleaning the kitchen, cleaning the bathroom, doing the laundry) it needs to remember where it left the tools used in previous tasks to re-use them in new tasks submitted by /u/fedetask [link] [comments]  ( 1 min )
    Is anyone interested in this project?
    submitted by /u/7NoteDancing [link] [comments]
    Multi agent rl for different action space
    I was wondering which multi rl algorithm would fit the following setting: A robot arm with a gripper, where the arm shell be controlled by a policy and the gripper by another policy. The action space of the two policies is different, but the observation space and the reward function is up to a design choice. For instance both policies could receive all observations or just local observations. Similar both policies could receive the same global reward or individual reward. Is there a paper comparing these approaches? Thanks for feedback submitted by /u/Informal_Temporary91 [link] [comments]  ( 1 min )
    [2205.10316] Seeking entropy: complex behavior from intrinsic motivation to occupy action-state path space
    submitted by /u/chimp73 [link] [comments]  ( 1 min )
    Probabilities in payoff matrix
    Hi guys I'm trying to understand how am I supposed to define probabilities to calculate (M&A, 1) and the other ones, I really dont get how. They say to "fix the frequencies pk for the outcome xk, such that the DM is indifferent between xk and the BEST outcome", but I dont get it Hope you can help me, Thanks! https://preview.redd.it/so23m2038e291.png?width=865&format=png&auto=webp&s=5a09cc65a7400e37ffee275f5f333de06085fab1 submitted by /u/Giorgio_v1 [link] [comments]  ( 1 min )
  • Open

    AI-generated donuts
    If you're going to open a late-night donut shop, you're going to need a unique set of over-the-top donuts to set the proper festive atmosphere. But how to keep the ideas coming? I decided to see what donut ideas I could get using OpenAI's  ( 3 min )
    Bonus: Donuts that will possibly end the planet
    AI Weirdness: the strange side of machine learning  ( 1 min )
  • Open

    Detecting Wildfires with Image Analysis
    Just read an article with a firefighter in AK, USA who said "you don't know about them until there's smoke". It gave me an idea that is either pointless or somewhat meaningful, and I didn't know if it was feasible, so I came here! The basic idea is to use a neural network to detect wildfires at a state wide, or nation wide scale. The firefighter also said that these natural, lightning caused fires could be active for 2-5 days before they know about it. They also said that the delayed response is because these usually happen in remote areas. With many of the wildfires being caused by humans (maybe not in AK but in other parts of the country), along with natural occurring wildfires, would the response be just as slow in less remote states such as California , Oregon, or Washington? If they were much quicker in more populated states, then I don't think using having a detection system besides humans confirming it or reporting it would be necessary, but if Alaska ever wanted to have a better way of detecting fires then why not use satellites to take pictures every hour or something, and send the data back to some neural network or AI that could determine whether there's smoke on the screen or not, and then have a human check it out to see if its a fire or not. I understand that clouds would look alot like smoke, but I'm thinking there's gotta be a way to train an image recognition deep neural network to differentiate the two. If it could that, then the response to wildfires might change from days to hours. It would probably cost a shit ton of money to send satellites out into space, along with the means of transferring that much data from the salleties to a neural network on the ground, but if it saves thousands of acres of wildlife, it could be worth it in future years when it's cheaper to do space stuff. This is purely an idea from someone who knows nothing about all of this. Just an idea. Any discussion is welcome! Thanks for reading :) submitted by /u/foxypablo [link] [comments]  ( 2 min )

  • Open

    [D] Weird trend in machine learning: papers tackle easy problems are well cited
    I don't know if anyone else have encountered this, but I have seen a lot of ML papers with extremely ideal assumptions (sometimes hardly relevant to machine learning) and then a group of reseachers, sometimes even from very well known universities, come together to "solve" this problem. Despite this, these papers will be quite well-cited as compared to the mean, which makes me really confused. I am not sure if the researchers are just not aware of the weakness of their assumptions, as sometimes their work intersects with other fields, which may have much deeper insights. Sometimes it almost reads like a couple of researchers are trying to discover a new field, and in the process wrote a paper together. I'm not mentioning the ML field in question but I suspect this is a "global" problem. Has anyone else seen a similar trend? FYI I just read the post: https://www.reddit.com/r/MachineLearning/comments/uyratt/d_i_dont_really_trust_papers_out_of_top_labs/ after I wrote this, but what's happening there (solving CIFAR10) is similar to what I'm describing, although the ML field I was thinking of is a bit more mathy. In any case, I'm sure this paper will be extraordinarily well cited. submitted by /u/fromnighttilldawn [link] [comments]  ( 1 min )
    [P] Confidence Intervals for binary classification
    Hello r/machinelearning, the situation: I am currently trying to estimate the accuracy of a machine vision system. The goal is to automate surface inspection in an industrial environment. e.g. detecting scratches and dents on a flat surface. i ran quite a few trials in order to estimate the systems accruacy. But a point-estimator isnt worth much without a confidence interval. Literature does not recommend to aproximate the binomial distribution of the bernoulli-trials I conducted with a normal distribution, when the probability of success is near 1 (or 0), because the central limit theorem does not apply here. Instead the agresti-Coull-CI is recommended. the problem: at confidence level 1-alpha = 95% the upper boundary of my confidence interval exceeds 100%, which strikes me as illogical. ​ the question: Can you give me a piece of advice on how to estimate the probability of (in)correct classification? Is the agresti-coull-intervall a good method for constructing a confidence intervall with n > 100 trials and if so is it possible to get an upper boundary >100% or does this result hint at a misscalculation? ​ Appendix: CI_AgrestiCoull = p +- 2*sqrt((p-p²)/n) with p = k/n , k = correct classifications + 2, n = number of trials + 4 (Values for confidence level 95%) https://www.jstor.org/stable/pdf/2685469.pdf?casa_token=0togmzmzFqMAAAAA:KSp8k04k859iUcDfVvJEzMh4y-2a7aheeBpOm5vMB-SNj2z7m8LeOs8C5gmJcZY1tmEhpK3OlA9Sqfav6mvHLhbBDaAhpYK2phXBD3uWGo0ZpdMFjqCo submitted by /u/chkthat [link] [comments]  ( 1 min )
    [R][P] Gradio Web Demo for HairCLIP: Design Your Hair by Text and Reference Image
    submitted by /u/Illustrious_Row_9971 [link] [comments]
    [D] Fairness of comparison of superiority and efficiency between different neural network architectures
    During the last 10 years Deep Learning has made impressive progress in various domains, but here I would like to be concrete and focus on computer vision, and in particular ImageNet as a popular benchmark. Computer vision models have evolved much from vanilla CNNs consisting of only convolutional layers + activations + pooling to more advanced with skip connections, depthwise separable convolutions, squeeze-excitation blocks, and, recently, vision transformers and derivative models. There are a lot of papers proposing some architecture changes and claiming that at a given amount of parameters and FLOPs they achieve the best result, i.e lie highest of all on the Pareto front. ​ From MobileNet V3 However, the final performance depends not only on how efficient and optimal is design of th…  ( 2 min )
    [R] OnePose can estimate 6D poses of arbitrary household objects without instance/category-specific training or CAD models
    submitted by /u/SpatialComputing [link] [comments]  ( 2 min )
    [P] I reviewed 50+ open-source MLOps tools. Here’s the result
    I spent the past weeks researching the most popular open-source MLOps tools and I would like to share the results with you. I created a website (https://mymlops.com/) listing the tools, explaining when to use each of them and pitfalls to watch out for. You can fill in your stack based on our template. Why did I do it? I feel the current MLOps landscape has amazing tools. But so many of them! Right now picking a solution feels like a puzzle. It’s confusing and incredibly hard to put the pieces together to fit your needs. This is just a small contribution. If you want me to be more helpful, I have a small ask. What are the things you currently struggle with but haven’t guidance anywhere? Here are some of my thoughts: Examples of MLOps stacks - tools that work well together and are popular combinations Compare tools with code snippets - see tools in action Stack “cookiecutter” - code templates of tools working together I would be grateful if you could point me in the right direction. No opinion is too small! 🙂 submitted by /u/Academic_Arrak [link] [comments]  ( 2 min )
    [R] How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers
    submitted by /u/koolaidman123 [link] [comments]
    [P] Homemade algorithm for building faces in < 1 minute with 80 images.
    Was inspired by the use of Wasserstein distances to come up with a new method for generating Eg. faces without using neural networks. Raw output + noise reduction https://preview.redd.it/r8babhtex7291.png?width=1090&format=png&auto=webp&s=f6539a7fab8b6d4b169fe0357151fc07d41e106d https://preview.redd.it/tunt9vn148291.png?width=1125&format=png&auto=webp&s=b109eeaa2beb7884946637290efbbc3d4bd50672 Specifications Training data around ~80 pieces of 128x128 images (B/W) Build time: Fast RAM usage: Extremely high Parallelizable: Yes (Step 1 in algorithm) and No (step 2 and step 3 in algorithm) Scalability: Poor. Algorithm used: 1) Build dictionary with pool of vectors/fragments from training set- Start by sliding over each image with a 9x9 grid (may be different sizes). For each sub fra…  ( 2 min )
    [R] Happy to share our latest Research paper: Federated Learning for Healthcare Domain - Pipeline, Applications and Challenges
    Hello everyone, We are pleased to share our new research paper with the ML community Federated Learning for Healthcare Domain - Pipeline, Applications and Challenges. Our paper was accepted at ACM Transactions on Computing for Healthcare 2022 and published in ACM Digital Library. Federated learning (FL) is a novel paradigm that allows deploying large-scale machine learning models trained in different data centres without transferring data. The sensitive and distributed nature of EHR (Electronic Health Records) in real-world scenarios simulates a need for an effective mechanism to learn from data residing in health-related institutions and hospitals while accounting for data privacy. This motivates us to examine the potential and value of federated learning in the healthcare domain. ​ …  ( 2 min )
    [R] Guidance: a cheat code for diffusion models (Blog post)
    submitted by /u/hardmaru [link] [comments]
    [D] which small data problems pique your interest
    Motivated by a recent post (can't seem to find it now; maybe from another subreddit) about how DL architecture research is gatekeep-ed by intensive computational requirements, I'd like to ask: what are your favorite small data problems? Why do you find it interesting? submitted by /u/SpookyTardigrade [link] [comments]  ( 1 min )
    [D][R] How to create/tag the dataset for the sentence similarity task?
    Hello everyone, I have a large corpus of domain-specific documents. I have gone through the quora question pair dataset and BIOSSES datasets for reference. I am trying to tag the dataset for different Language understanding tasks, including sentence similarity. But I am having difficulty creating the SST dataset for such a task. If I have 2-3 experts in that field, what is the best way to create the dataset for such a task? My thoughts: Approach 1 I am converting all documents into paragraphs and encoding them using USE, Elmo or domain-specific Bert embeddings. Extract top-10 similar sentences for each sentence (using Cosine or Levenshtein distance) A UI will show the expert a sentence and its similar sentences. An expert will choose the best similar sentence for the given sentence, and in the backend, both sentences will be tagged with label 1 ( which indicates that both are similar) Later, expert two will review this tagged dataset ( tagged by expert 1 ) and give the scale between 0-5; how similar are they? ​ Approach 2 Replace random keywords in sentences with synonyms using the domain pre-trained bert model. Show both sentences ( replaced sentence and original sentence ) to the expert, and let the expert tag the sentence with a scale of 0-5? ​ I'd love to hear suggestions from members of this subreddit. Any kind of input would be appreciated. submitted by /u/aadityaura [link] [comments]  ( 1 min )
    [D] Best Tech Stack for Machine Learning Web Applications
    According to your experience deploying ML Solutions, what would be the best tech stack for deploying a web application which integrates various ML Algorithms in the background? Currently, I'm looking at FA.R.M (FastAPI, React JS, MongoDB) - but not sure what your take would be on this. Also, since most of our algorithms run on Notebooks - what would be the best practice for moving their outputs to production? Any kind of input would be appreciated. submitted by /u/XhoniShollaj [link] [comments]  ( 1 min )
  • Open

    Cosmic Creation | MASTERPIECE BATTLE - RAW VS SMOOTH
    submitted by /u/LordPewPew777 [link] [comments]
    Borealis AI Research Introduces fAux: A New Approach To Test Individual Fairness via Gradient Alignment
    Machine learning models are trained on massive datasets with hundreds of thousands, if not billions, of parameters. However, how these models translate the input parameters into results is unknown. Having said that, the decision-making behavior of the model is difficult to comprehend. Furthermore, models are frequently skewed towards specific parameters due to faulty assumptions made during the machine learning process, which are difficult to detect. Researchers from Borealis AI introduced fAux, a new approach to testing fairness. They state that one approach to assessing fairness at the global level is to look at it from afar. By aggregating results across a complete population, the goal is to statistically quantify disparate treatment. The distribution of good and negative outcomes is then tested using fairness criteria. These are simple to build and can be computed without having access to the original model — because one only needs the model’s predictions. Moreover, historical data can even be tested. Continue reading | Research: 'fAux: Testing Individual Fairness via Gradient Alignment' https://i.redd.it/9crqd3ymea291.gif submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    3D modelization directly from a picture...Insane speed and quality
    submitted by /u/the_anonymizer [link] [comments]
    AI helps to track health of coroal reefs by learning "song of the reef"
    submitted by /u/qptbook [link] [comments]
    Moving Local AI/ML Experiments to the Cloud with Terraform Plugin - Tutorial
    Training a machine learning model locally is quick & easy to set up a new project on a local machine. This is sufficient for simple experiments (with reduced data subsets or small models) without paying to rent heavy cloud compute resources which are also could be intimidating even with a decent background in DevOps. Once you locally set up and iterate over your data & code enough, you may reach a point where more powerful compute resources are needed to train a larger model and/or use bigger datasets with a methodology explained in the following guide: Moving Local Experiments to the Cloud with Terraform Provider Iterative (TPI) submitted by /u/thumbsdrivesmecrazy [link] [comments]  ( 1 min )
    Meta's new muscle AI could power next-gen avatars
    submitted by /u/much_successes [link] [comments]  ( 1 min )
    Experiential Learning from Sequential Data - Anton Kolonin - OpenCog AGI Discussion
    submitted by /u/akolonin [link] [comments]
  • Open

    Developing a Python Program Using Inspection Tools
    Python is an interpreting language. It means there is an interpreter to run our program, rather than compiling the code and running natively. In Python, a REPL (read-eval-print loop) can run commands line by line. Together with some inspection tools provided by Python, it helps to develop codes. In the following, you will see how […] The post Developing a Python Program Using Inspection Tools appeared first on Machine Learning Mastery.  ( 16 min )
  • Open

    "Flexible Diffusion Modeling of Long Videos", Harvey et al 2022 (Minecraft, CARLA self-driving car, DMLab video modeling: stable 1h-long video samples)
    submitted by /u/gwern [link] [comments]  ( 1 min )
  • Open

    How to memorize the ASCII table
    Before discussing how you could memorize an table of ASCII characters and numeric values, I should say a little about why you might do so. One reason is simply for the challenge. It’s more doable than it may sound. It’s also useful information, though it’s debatable whether it’s worth memorizing. YMMV. There was a time […] How to memorize the ASCII table first appeared on John D. Cook.  ( 4 min )

  • Open

    [R] Reconnaissance Blind Chess - Join the NeurIPS Competition!
    Create a bot for the NeurIPS 2022 competition in Reconnaissance Blind Chess! Reconnaissance Blind Chess is a chess variant designed for new research in artificial intelligence. RBC includes imperfect information, long-term strategy, explicit observations, and almost no common knowledge. These features appear in real-world scenarios, and challenge even state of the art algorithms including those used to create super-human bots in chess, Go, and poker, for example. Each player of RBC controls traditional chess pieces, but cannot directly see the locations of her opponent's pieces. Rather, she learns partial information each turn by privately sensing a 3x3 area of the board. RBC's foundation in traditional chess makes it familiar and entertaining to human players, too! There is no cost to enter this tournament. Winners will receive a small monetary prize and authors of the best AIs will be invited talk about their bots at NeurIPS, the world's largest AI conference. Learn more, play a game of RBC yourself, and join our research community at https://rbc.jhuapl.edu ! ​ https://preview.redd.it/xzd9z110c3291.png?width=150&format=png&auto=webp&s=a0b86fe6c0e3c3060f30d0e1eb8acfd81f6bb9dd ​ Organized by: Johns Hopkins University Applied Physics Laboratory with Ashley J. Llorens (Microsoft Research) Todd W. Neller (Gettysburg College) Raman Arora (Johns Hopkins University) Bo Li (University of Illinois) Mykel J. Kochenderfer (Stanford University) submitted by /u/rwgardner [link] [comments]  ( 1 min )
    [D] How to train a model to identify ranked classes?
    Hi, I am trying to train a model to estimate the severity of an image in classes like "normal", "mild", "moderate", "severe". One approach would be to do multiclass classification, but that seems suboptimal since it doesn't encode the knowledge that the classes are not random, but ranked (ie normal < mild < moderate < severe). Another approach is to encode these classes are a number (ie normal=0, mild=1, moderate=2, severe=3) and perform regression. This seems sensible but I have never seen it done. Is there any literature on this topic? Is there another approach I am missing? submitted by /u/rsandler [link] [comments]  ( 1 min )
    [P] TensorFlow Similarity 0.16 is out
    Happy Friday, Just a quick note that TensorFlow Similarity 0.16 is out -- this release beside adding the XMB loss is mostly focus on refactoring and optimizing the core components to ensure everything works smoothly and accurately. Details are in the changelog as usual and a simple pip install -U tensorflow_similarity should just work. We spend a lot of time behind the scene making sure STOA papers results can be reproduced and fixed a lot of bugs (including in augmentations) that should give you some accuracy boost compared to 0.16. Next we're going to keep working toward providing a strong foundations and extensive benchmarking capabilities so you can rely on it for your research. The last missing piece before 1.0 is how we do storage so it scale past 10M points and work with many ANN backend. If you are interested in helping let us know. Have a great weekend! submitted by /u/ebursztein [link] [comments]  ( 1 min )
    [R] Flexible Diffusion Modeling of Long Videos
    Paper. Abstract: We present a framework for video modeling based on denoising diffusion probabilistic models that produces long-duration video completions in a variety of realistic environments. We introduce a generative model that can at test-time sample any arbitrary subset of video frames conditioned on any other subset and present an architecture adapted for this purpose. Doing so allows us to efficiently compare and optimize a variety of schedules for the order in which frames in a long video are sampled and use selective sparse and long-range conditioning on previously sampled frames. We demonstrate improved video modeling over prior work on a number of datasets and sample temporally coherent videos over 25 minutes in length. We additionally release a new video modeling dataset and semantically meaningful metrics based on videos generated in the CARLA self-driving car simulator. Blog post (includes generated videos). Twitter thread from some of the authors. submitted by /u/Wiskkey [link] [comments]  ( 1 min )
    On the Paradox of Learning to Reason from Data - Language models only learn a facsimile of reasoning based off of inherent statistical features
    submitted by /u/stressed-nb [link] [comments]  ( 1 min )
    [P] BrainAgent: Open Source for SOTA Performance on DMLab-30 of Multi-Task RL !
    Hello. I'd like to introduce an awesome project "Brain Agent." github: https://github.com/kakaobrain/brain_agent Brain Agent is a codebase for large-scale RL. The key contribution is the SOTA result & open-sourced codes and checkpoints for the DMLab-30 environment. DMLab-30 for Multi-task RL DMLab-30 is an environment for multi-task RL, consisting of different 30 tasks, developed by DeepMind. The tasks are hard to solve and important for multi-task RL research, but there was no reproducible codebase for SOTA performance on it. As the result, the SOTA performance has been always reported in the papers by DeepMind, but other RL researchers except for DeepMind cannot conduct a cutting-edge research on the environment. In this project, Kakao Brain succeed in achieving state-of-the-art performance on DMLab-30. In addition, we released the codes for evaluation and pretrained checkpoints, and hope our project help many RL researches focus how to solve the difficult multi-task RL tasks on DMLab-30. Reported Performance of Released Checkpoints Enjoy and ⭐️ submitted by /u/leedoyup [link] [comments]  ( 1 min )
    [D] Can AI Replace Our Graphic Designer?
    A nice video by MKBHD that gives some really nice insights on the effect that Dalle 2 or ml will have in the future. Considering that the model can make multiple examples in a very short space of time, things are getting very interesting and scary l will say. https://youtu.be/MwAAH9tBoMg submitted by /u/takuonline [link] [comments]  ( 1 min )
    Feasibility of aggregating text messages to train a system for natural language/chat bot [D]
    I wonder if this is possible, I know getting messages from random people could create a lot of noise and varied inputs, but does it seem feasible to clean/prepare texts in such a way to use them for this purpose? submitted by /u/WordJord [link] [comments]  ( 1 min )
    [D] good datasets for evaluating ranking models
    What are some good datasets for evaluating ranking models these days? I have seen Criteo and MoveLen-1m from Ed Chi’s team’s papers. Criteria doesn’t seem to be available anymore. Any other things people have been using? Please advise. Thanks. submitted by /u/scan33scan33 [link] [comments]
    [R] Training ReLU networks to high uniform accuracy is intractable
    PDF on ResearchGate / arXiv Abstract: Statistical learning theory provides bounds on the necessary number of training samples needed to reach a prescribed accuracy in a learning problem formulated over a given target class. This accuracy is typically measured in terms of a generalization error, that is, an expected value of a given loss function. However, for several applications -- for example in a security-critical context or for problems in the computational sciences -- accuracy in this sense is not sufficient. In such cases, one would like to have guarantees for high accuracy on every input value, that is, with respect to the uniform norm. In this paper we precisely quantify the number of training samples needed for any conceivable training algorithm to guarantee a given uniform accuracy on any learning problem formulated over target classes containing (or consisting of) ReLU neural networks of a prescribed architecture. We prove that, under very general assumptions, the minimal number of training samples for this task scales exponentially both in the depth and the input dimension of the network architecture. As a corollary we conclude that the training of ReLU neural networks to high uniform accuracy is intractable. In a security-critical context this points to the fact that deep learning based systems are prone to being fooled by a possible adversary. We corroborate our theoretical findings by numerical results. submitted by /u/julbern [link] [comments]  ( 1 min )
    [P] Solving Sudoku in real-time using a Convolutional Neural Network and OpenCV
    Article and a source code: https://dmitryelj.medium.com/solving-sudoku-in-real-time-using-a-convolutional-neural-network-and-opencv-e47a92478dce submitted by /u/DmitriiElj [link] [comments]
    [P] Making the annotation part of Hasty free
    We all know that you need large quantities of high-quality data to succeed in AI. But producing that data is expensive. Until now. Whatever you are paying for labeling is too much with our new release. Data labeling takes anywhere from 35 to 80% of project budgets. We drastically reduce the cost by giving you free access to all our labeling automation features without imposing usage or user limits - and it comes with no strings attached*. Read more about what we offer and how it compares with the competition here: https://hasty.ai/content-hub/articles/making-labeling-tooling-free?utm_source=2884ka1 https://preview.redd.it/putp5jrjvz191.png?width=1200&format=png&auto=webp&s=abe9fcc29ca90fd4033b36f2ed58cf504f51240e \Disclaimer: Our free plan has an upper limit of 30GB of storage.* submitted by /u/treebeard_hasty_ai [link] [comments]  ( 1 min )
    [D] I don't really trust papers out of "Top Labs" anymore
    I mean, I trust that the numbers they got are accurate and that they really did the work and got the results. I believe those. It's just that, take the recent "An Evolutionary Approach to Dynamic Introduction of Tasks in Large-scale Multitask Learning Systems" paper. It's 18 pages of talking through this pretty convoluted evolutionary and multitask learning algorithm, it's pretty interesting, solves a bunch of problems. But two notes. One, the big number they cite as the success metric is 99.43 on CIFAR-10, against a SotA of 99.40, so woop-de-fucking-doo in the grand scheme of things. Two, there's a chart towards the end of the paper that details how many TPU core-hours were used for just the training regimens that results in the final results. The sum total is 17,810 core-hours. Let's …  ( 6 min )
    [D] Are there any existing algorithms that apply kNN in a bootstrapping manner?
    Just out of curiosity, I was wondering what would happen if we combined kNN and bootstrapping in a certain manner. Specifically, say we have N points that we want to classify and we have a labeled set of points that are used for kNN. After applying kNN to these N points individually, we iterate over a process defined as: Repeat until class change is arbitrarily minimal: For each point in N observed points: Look at L nearest neighbors within N observed points Alter the class of the current point by majority vote of the L points In other words, applying kNN iteratively on unobserved points until "convergence" (i.e. assignment change of points is minimal). Going ahead with the idea, I tried implementing a quick snippet of code to test this out on some toy datasets, and for datasets where vanilla kNN (without altering the representation of the data) obtains an accuracy of 65%, this iterative method improves it a bit, up to around 70%. I tried doing a quick search if there already exists a similar algorithm to this and I couldn't find anything, but this seems like a simple enough idea that others must have already considered in the past. EDIT: Not too relevant, but it might help to add that this was inspired a bit from PU Learning. submitted by /u/TrepidEd0601 [link] [comments]  ( 2 min )
    Accurate detection of sepsis at ED triage using machine learning with clinical natural language processing (Preprint open for comments)
    submitted by /u/creilly94010 [link] [comments]
    [Discussion] How does the SimCLR loss function not penalize image belonging to the same class?
    I have a question about SimCLR that I have not been able to understand. In the numerator of the SimCLR loss function, $z_i$ is the original image, and $z_j$ is the augmented version of $z_i$. We want the distance to those to be small. Similarly in the denominator, $z_k$ for K = 1:2N, k =/= i, is the index of all other images in the batch. Those are going to be pushed away from $z_i$. This is fine, but what guarantee do we have that k wont belong to an image of the same class as image i? The way this is structured, we will also end up pushing away images of the same class. Thanks submitted by /u/Ayakalam [link] [comments]  ( 2 min )
  • Open

    Researchers From Imperial College London introduce TsT-GAN: A Novel Framework For Training Time-Series Generative Models
    Nowadays, data is considered a fuel in the data analytics field. The real-time applications require time series data for analysis and future prediction. But all these applications usually lack the necessary, sufficient data for analysis. Hence, various data augmentation techniques need to be adopted. Researchers from Imperial College London introduce a framework called TsT-GAN, based on generative adversarial networks (GAN), that are utilized to augment the time-series data. It aims to fulfill the various objectives like capturing the steps of the conditional distribution of real-time sequences and creating a model that joins the distribution of all the real-time sequences. The paper’s significant contribution is to develop the model consisting of a generator that can produce entire joint distributions considering the distribution conditions. The training framework can be applied to any time series dataset that quantitively results in a standard method that can be trained on the synthetic test on a realistic approach while qualitatively using t-SNE. Continue Reading | Check out the paper and related codes. https://preview.redd.it/ajeuctsyy2291.png?width=778&format=png&auto=webp&s=2d3613ee07aff5b74800a0d8c20826b82fd36097 submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    in a regular ANN, does the weights and biases have to be randomized at start for backwardsprobagation to work?
    Hi! I'm currently working on a library for creating and training neural networks, however my backwardsprobagation function seems to be broken. I have gone back and forth in the algorithms, and it seems to run the correct mathematical operations, yet it doesn't work. My test involves two patterns as input and the expected output is 1, 0 for the first pattern and 0, 1 for the other. The test network has 1 hidden layer, 4 input neurons, 2 hidden neurons and 2 output neuron. When I checked my debug logs I saw that the derivative of the hidden layer, it was 0. I realized this was at least in my implementation due to the two outputs cancelling each other out. Is this supposed to happen? Am I just to randomize all values at the start or is my implementation wrong? submitted by /u/WellWhatDoIPutHere [link] [comments]  ( 1 min )
    A Long Short-Term Memory for AI Applications in Spike-based Neuromorphic Hardware
    submitted by /u/nnnaikl [link] [comments]  ( 1 min )
  • Open

    Traditional Reinforcement Learning versus POMDP.
    What exactly is the relationship between partial observability of states and the Reinforcement Learning Problem? Sutton and Barto address partial observability only briefly for about 2 pages by the back chapters, and their description is that there is some latent space of unobserved states. But their description makes it sound like this is some kind of "extension" to RL, rather than something that effects the core mechanics of an RL agent. It seems to me that POMDPs act on the RL problem in a different way than traditional RL agents, even down to how they construct their Q network, and how they go about producing their policy network. In one sentence : a traditional RL agent explores "dumb" and a POMDP agent explores "smart". I will give two examples below #POMDPs reason about un-visited s…  ( 3 min )
    Offered a position as an RL Engineer - Seeking Advice
    Hi all, I was recently offered a full time roll as an RL engineer (I am using the term engineer because I do not have a PhD so I wouldn't in good faith qualify myself as a "researcher", however, I will be doing R&D for the company on SOTA RL). I come from a standard MLE background. I have just over 3 years of experience and my masters in machine learning is almost complete. I do have experience in RL from academia. DQN from scratch, QMIX from scratch, etc. RL is a strong interest of mine and I would love to pursue it. I do have a competing MLE offer from another company and I am wondering if it would be too risky to accept a full-time RLE roll. I believe RL has massive potential in the future, however, I don't want to shoot myself in the foot by specializing for a few years on the wrong topic. Any advice would be appreciated. Also fwiw, I am being considered for a full time NLP engineering role as well, should that come through I will also have that to consider. I am a big fan of transformer tech. Any advice is appreciated. v/r, submitted by /u/OpenSource-AI [link] [comments]  ( 2 min )
    BrainAgent: Open Source for SOTA Performance on DMLab-30 of Multi-Task RL !
    Hello. I'd like to introduce an awesome project "Brain Agent." github: https://github.com/kakaobrain/brain_agent Brain Agent is a codebase for large-scale RL. The key contribution is the SOTA result & open-sourced codes and checkpoints for the DMLab-30 environment. Examples of DMLab-30 DMLab-30 is an environment for multi-task RL, consisting of different 30 tasks, developed by DeepMind. The tasks are hard to solve and important for multi-task RL research, but there was no reproducible codebase for SOTA performance on it. As the result, the SOTA performance has been always reported in the papers by DeepMind, but other RL researchers except for DeepMind cannot conduct a cutting-edge research on the environment. In this project, Kakao Brain succeed in achieving state-of-the-art performance on DMLab-30. In addition, we released the codes for evaluation and pretrained checkpoints, and hope our project help many RL researches focus how to solve the difficult multi-task RL tasks on DMLab-30. ​ Reported Performance of Released Checkpoint Enjoy and ⭐️ submitted by /u/leedoyup [link] [comments]  ( 1 min )
    In the spirit of #throwbackthursday, this is some early concept art for the training cloud where all the magic happens in our game powered by Reinforcement Learning, Animo Island!
    submitted by /u/AnimoIsland [link] [comments]  ( 1 min )
    RL Simulators and Frameworks
    I just started working on a personal project in which I want to solve a RL task and I have to create a whole new environment for that. I thought of using Mujoco as the simulator, but it seems to be pretty difficult to install since it has been recently acquired by deepmind and difficult to interface with RL frameworks/libraries. At the moment what simulator and rl library (algorithms) would you recommend to use? If it is easy to create a container on docker with gpu that would be even better. submitted by /u/jeferal [link] [comments]  ( 1 min )
    Multi agent path planing
    So as part of my work I am trying to tackle the multi agent path planning problem. I have already try a few optimization techniques like PSO (did not give good results) and genetic algorithms like NEAT (gave decent results but still room for improvement) so I wanted to know if anyone has worked on this problem before, what have they used and what kind of results they got? PS: I am currently testing using machine learning techniques for this like imitation learning and maybe after that I might test RL so if anyone has tried those for this problem that I would love to know what they ended up getting. submitted by /u/temp_phd [link] [comments]  ( 1 min )
    Every day, I was tossing and turning, thinking about how I made such a rubbish graduation project. The DRL is really hard😭
    submitted by /u/ecstayalive [link] [comments]  ( 1 min )
  • Open

    It’s All About Data: The Training Methods of Deep Learning
    In deep learning, there are different training methods. Which one we use in an AI project depends on the data provided by our customer: how…  ( 2 min )
  • Open

    University of Negev Researchers Develop DeepDPM: A Deep Clustering Algorithm With An Unknown Number Of Clusters
    Clustering is an essential unsupervised learning job in which class labels are not accessible, unlike in the supervised situation of classification. Furthermore, the number of classes represented by K and their relative sizes are unknown in the totally unsupervised environment on which this research focuses. Clustering tasks have not been overlooked by Deep Learning (DL). Large and high-dimensional datasets are typically clustered better and more efficiently by DL approaches than by traditional clustering methods. However, while nonparametric approaches have advantages over parametric methods (methods that require a known K) in classical clustering, there are just a few nonparametric deep clustering methods. Unfortunately, the latter is neither scalable nor effective enough. It is advantageous to be able to deduce the latent K. Parametric approaches may perform poorly if K is not accurately estimated. In both balanced and unbalanced datasets, using the wrong K can have a major negative impact on parametric approaches. Continue Reading | Check out the paper and codes ​ https://i.redd.it/rahghkvc41291.gif submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    The past two years went down in a blink because of some pandemic? Check out this 2021 recap of the most exciting advancements in the AI field to see what you may have missed out on!
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 1 min )
    AI and data science resourse
    Hello world! I want to share with you my telegram channel dedicated to everything concerning AI and data science (from math and statistics to programming and machine learning). I don't promise you brand new information I just collect knowledge from different sources and share it in a nice and structured way in order to make your learning process easier. I would love to see you there! Also I am open to criti cism so feel free to tell me your opinion! submitted by /u/NordicDude49 [link] [comments]  ( 1 min )
    Iterative to Launch Open Source Tool, First to Train ML Models on Any Cloud Using Terraform Solution
    submitted by /u/thumbsdrivesmecrazy [link] [comments]
    Beginner; looking for AI learning buddy
    Hey there! Just started getting into machine learning. Currently learning linear regression. Would love to find a buddy to talk AI with and learn it together with (bonus if you know some stuff about hardware as well). I personally adapt the common philosophy of "if you can't explain it then you don't understand it" so if you believe we have the same goal and mindset, hit me up! submitted by /u/Signal_Warrior [link] [comments]  ( 1 min )
    Olivier Levasseur’s Treasure
    Would it be possible to program a AI that can decipher any hidden code? Or is it not? submitted by /u/I__like_bagels [link] [comments]
  • Open

    TNN7: A Custom Macro Suite for Implementing Highly Optimized Designs of Neuromorphic TNNs. (arXiv:2205.07410v2 [cs.AR] UPDATED)
    Temporal Neural Networks (TNNs), inspired from the mammalian neocortex, exhibit energy-efficient online sensory processing capabilities. Recent works have proposed a microarchitecture framework for implementing TNNs and demonstrated competitive performance on vision and time-series applications. Building on these previous works, this work proposes TNN7, a suite of nine highly optimized custom macros developed using a predictive 7nm Process Design Kit (PDK), to enhance the efficiency, modularity and flexibility of the TNN design framework. TNN prototypes for two applications are used for evaluation of TNN7. An unsupervised time-series clustering TNN delivering competitive performance can be implemented within 40 uW power and 0.05 mm^2 area, while a 4-layer TNN that achieves an MNIST error rate of 1% consumes only 18 mW and 24.63 mm^2. On average, the proposed macros reduce power, delay, area, and energy-delay product by 14%, 16%, 28%, and 45%, respectively. Furthermore, employing TNN7 significantly reduces the synthesis runtime of TNN designs (by more than 3x), allowing for highly-scaled TNN implementations to be realized.
    Adaptive Fairness-Aware Online Meta-Learning for Changing Environments. (arXiv:2205.11264v2 [cs.LG] UPDATED)
    The fairness-aware online learning framework has arisen as a powerful tool for the continual lifelong learning setting. The goal for the learner is to sequentially learn new tasks where they come one after another over time and the learner ensures the statistic parity of the new coming task across different protected sub-populations (e.g. race and gender). A major drawback of existing methods is that they make heavy use of the i.i.d assumption for data and hence provide static regret analysis for the framework. However, low static regret cannot imply a good performance in changing environments where tasks are sampled from heterogeneous distributions. To address the fairness-aware online learning problem in changing environments, in this paper, we first construct a novel regret metric FairSAR by adding long-term fairness constraints onto a strongly adapted loss regret. Furthermore, to determine a good model parameter at each round, we propose a novel adaptive fairness-aware online meta-learning algorithm, namely FairSAOML, which is able to adapt to changing environments in both bias control and model precision. The problem is formulated in the form of a bi-level convex-concave optimization with respect to the model's primal and dual parameters that are associated with the model's accuracy and fairness, respectively. The theoretic analysis provides sub-linear upper bounds for both loss regret and violation of cumulative fairness constraints. Our experimental evaluation on different real-world datasets with settings of changing environments suggests that the proposed FairSAOML significantly outperforms alternatives based on the best prior online learning approaches.
    CNNs are Myopic. (arXiv:2205.10760v2 [cs.CV] UPDATED)
    We claim that Convolutional Neural Networks (CNNs) learn to classify images using only small seemingly unrecognizable tiles. We show experimentally that CNNs trained only using such tiles can match or even surpass the performance of CNNs trained on full images. Conversely, CNNs trained on full images show similar predictions on small tiles. We also propose the first a priori theoretical model for convolutional data sets that seems to explain this behavior. This gives additional support to the long standing suspicion that CNNs do not need to understand the global structure of images to achieve state-of-the-art accuracies. Surprisingly it also suggests that over-fitting is not needed either.
    Representation learning with function call graph transformations for malware open set recognition. (arXiv:2205.06918v2 [cs.CR] UPDATED)
    Open set recognition (OSR) problem has been a challenge in many machine learning (ML) applications, such as security. As new/unknown malware families occur regularly, it is difficult to exhaust samples that cover all the classes for the training process in ML systems. An advanced malware classification system should classify the known classes correctly while sensitive to the unknown class. In this paper, we introduce a self-supervised pre-training approach for the OSR problem in malware classification. We propose two transformations for the function call graph (FCG) based malware representations to facilitate the pretext task. Also, we present a statistical thresholding approach to find the optimal threshold for the unknown class. Moreover, the experiment results indicate that our proposed pre-training process can improve different performances of different downstream loss functions for the OSR problem.
    Nonparametric likelihood-free inference with Jensen-Shannon divergence for simulator-based models with categorical output. (arXiv:2205.10890v2 [stat.ME] UPDATED)
    Likelihood-free inference for simulator-based statistical models has recently attracted a surge of interest, both in the machine learning and statistics communities. The primary focus of these research fields has been to approximate the posterior distribution of model parameters, either by various types of Monte Carlo sampling algorithms or deep neural network -based surrogate models. Frequentist inference for simulator-based models has been given much less attention to date, despite that it would be particularly amenable to applications with big data where implicit asymptotic approximation of the likelihood is expected to be accurate and can leverage computationally efficient strategies. Here we derive a set of theoretical results to enable estimation, hypothesis testing and construction of confidence intervals for model parameters using asymptotic properties of the Jensen--Shannon divergence. Such asymptotic approximation offers a rapid alternative to more computation-intensive approaches and can be attractive for diverse applications of simulator-based models. 61
    Domain Adversarial Graph Convolutional Network Based on RSSI and Crowdsensing for Indoor Localization. (arXiv:2204.05184v2 [cs.NI] UPDATED)
    In recent years, due to the wider WiFi coverage and the popularization of mobile communication devices, the technology of indoor positioning using WiFi fingerprints has been rapidly developed. Currently, most supervised methods need to collect a large amount of data to construct fingerprint datasets, which is labor-intensive and time-consuming. In addition, many studies focused on the ideal laboratory environment and lack the consideration in the practical application environment, especially in the scenario of multiple large multi-floor buildings. To solve these problems, we proposed a novel WiDAGCN model which can be trained by a few labeled site survey data and unlabeled crowdsensing WiFi fingerprints. To comprehensively represent the topology structure of the data, we constructed heterogeneous graphs according to the received signal strength indicators (RSSIs) between the waypoints and WiFi access points (APs). Moreover, previous WiFi indoor localization studies rarely involved complete graph feature representation, thus we use graph convolutional network (GCN) to extract graph-level embeddings. There are also some difficult problems, for example, a large amount of unlabeled data that cannot be applied to a supervised model, and the existence of multiple data domains leads to inconsistency in data distribution. Therefore, a semi-supervised domain adversarial training scheme was used to make full use of unlabeled data and align the data distribution of different domains. A public indoor localization dataset containing different buildings was used to evaluate the performance of the model. The experimental results show that our system can achieve a competitive localization accuracy in large buildings such as shopping malls.
    A Multi-Stage Duplex Fusion ConvNet for Aerial Scene Classification. (arXiv:2203.16325v2 [cs.CV] UPDATED)
    Existing deep learning based methods effectively prompt the performance of aerial scene classification. However, due to the large amount of parameters and computational cost, it is rather difficult to apply these methods to multiple real-time remote sensing applications such as on-board data preception on drones and satellites. In this paper, we address this task by developing a light-weight ConvNet named multi-stage duplex fusion network (MSDF-Net). The key idea is to use parameters as little as possible while obtaining as strong as possible scene representation capability. To this end, a residual-dense duplex fusion strategy is developed to enhance the feature propagation while re-using parameters as much as possible, and is realized by our duplex fusion block (DFblock). Specifically, our MSDF-Net consists of multi-stage structures with DFblock. Moreover, duplex semantic aggregation (DSA) module is developed to mine the remote sensing scene information from extracted convolutional features, which also contains two parallel branches for semantic description. Extensive experiments are conducted on three widely-used aerial scene classification benchmarks, and reflect that our MSDF-Net can achieve a competitive performance against the recent state-of-art while reducing up to 80% parameter numbers. Particularly, an accuracy of 92.96% is achieved on AID with only 0.49M parameters.
    Near-Optimal Sparse Allreduce for Distributed Deep Learning. (arXiv:2201.07598v2 [cs.DC] UPDATED)
    Communication overhead is one of the major obstacles to train large deep learning models at scale. Gradient sparsification is a promising technique to reduce the communication volume. However, it is very challenging to obtain real performance improvement because of (1) the difficulty of achieving an scalable and efficient sparse allreduce algorithm and (2) the sparsification overhead. This paper proposes O$k$-Top$k$, a scheme for distributed training with sparse gradients. O$k$-Top$k$ integrates a novel sparse allreduce algorithm (less than 6$k$ communication volume which is asymptotically optimal) with the decentralized parallel Stochastic Gradient Descent (SGD) optimizer, and its convergence is proved. To reduce the sparsification overhead, O$k$-Top$k$ efficiently selects the top-$k$ gradient values according to an estimated threshold. Evaluations are conducted on the Piz Daint supercomputer with neural network models from different deep learning domains. Empirical results show that O$k$-Top$k$ achieves similar model accuracy to dense allreduce. Compared with the optimized dense and the state-of-the-art sparse allreduces, O$k$-Top$k$ is more scalable and significantly improves training throughput (e.g., 3.29x-12.95x improvement for BERT on 256 GPUs).
    ChemicalX: A Deep Learning Library for Drug Pair Scoring. (arXiv:2202.05240v3 [cs.LG] UPDATED)
    In this paper, we introduce ChemicalX, a PyTorch-based deep learning library designed for providing a range of state of the art models to solve the drug pair scoring task. The primary objective of the library is to make deep drug pair scoring models accessible to machine learning researchers and practitioners in a streamlined framework.The design of ChemicalX reuses existing high level model training utilities, geometric deep learning, and deep chemistry layers from the PyTorch ecosystem. Our system provides neural network layers, custom pair scoring architectures, data loaders, and batch iterators for end users. We showcase these features with example code snippets and case studies to highlight the characteristics of ChemicalX. A range of experiments on real world drug-drug interaction, polypharmacy side effect, and combination synergy prediction tasks demonstrate that the models available in ChemicalX are effective at solving the pair scoring task. Finally, we show that ChemicalX could be used to train and score machine learning models on large drug pair datasets with hundreds of thousands of compounds on commodity hardware.
    Worst-case Performance of Greedy Policies in Bandits with Imperfect Context Observations. (arXiv:2204.04773v2 [stat.ML] UPDATED)
    Contextual bandits are canonical models for sequential decision-making under uncertainty in environments with time-varying components. In this setting, the expected reward of each bandit arm consists of the inner product of an unknown parameter with the context vector of that arm. The classical bandit settings heavily rely on assuming that the contexts are fully observed, while study of the richer model of imperfectly observed contextual bandits is immature. This work considers Greedy reinforcement learning policies that take actions as if the current estimates of the parameter and of the unobserved contexts coincide with the corresponding true values. We establish that the non-asymptotic worst-case regret grows poly-logarithmically with the time horizon and the failure probability, while it scales linearly with the number of arms. Numerical analysis showcasing the above efficiency of Greedy policies is also provided.
    Lorentzian Fully Hyperbolic Generative Adversarial Network. (arXiv:2201.12825v2 [cs.LG] UPDATED)
    With the recent advance of deep learning, neural networks have been extensively used for data in non-Euclidean domains. In particular, hyperbolic neural networks have proved successful in processing hierarchical information of data. While a variety of hyperbolic neural network structures have been proposed, they mainly focus on discriminative tasks, and generative models in the hyperbolic space have scarcely been studied. In this work, we propose a hyperbolic generative adversarial network (GAN) within the Lorentz model for generating hyperbolic data. In addition to existing hyperbolic operations, we design novel hyperbolic layers to guarantee stable training. We first use synthetic data to show that our network is able to learn simple distribution in the hyperbolic space. Moreover, by virtue of an autoencoder, we construct a neural network model, named HAEGAN, for generating more complex data in the hyperbolic space. HAEGAN contains three parts: first, a hyperbolic autoencoder; second, a hyperbolic GAN for generating the latent embedding of the autoencoder; third, a generator that inherits the decoder from autoencoder and the generator from the GAN. Experiments show that HAEGAN is able to generate complex data with state-of-the-art structure-related performance.
    Incremental Inference on Higher-Order Probabilistic Graphical Models Applied to Constraint Satisfaction Problems. (arXiv:2202.12916v2 [cs.LG] UPDATED)
    Probabilistic graphical models (PGMs) are tools for solving complex probabilistic relationships. However, suboptimal PGM structures are primarily used in practice. This dissertation presents three contributions to the PGM literature. The first is a comparison between factor graphs and cluster graphs on graph colouring problems such as Sudokus - indicating a significant advantage for preferring cluster graphs. The second is an application of cluster graphs to a practical problem in cartography: land cover classification boosting. The third is a PGMs formulation for constraint satisfaction problems and an algorithm called purge-and-merge to solve such problems too complex for traditional PGMs.
    Reproducibility in Optimization: Theoretical Framework and Limits. (arXiv:2202.04598v2 [math.OC] UPDATED)
    We initiate a formal study of reproducibility in optimization. We define a quantitative measure of reproducibility of optimization procedures in the face of noisy or error-prone operations such as inexact or stochastic gradient computations or inexact initialization. We then analyze several convex optimization settings of interest such as smooth, non-smooth, and strongly-convex objective functions and establish tight bounds on the limits of reproducibility in each setting. Our analysis reveals a fundamental trade-off between computation and reproducibility: more computation is necessary (and sufficient) for better reproducibility.
    FedBalancer: Data and Pace Control for Efficient Federated Learning on Heterogeneous Clients. (arXiv:2201.01601v2 [cs.LG] UPDATED)
    Federated Learning (FL) trains a machine learning model on distributed clients without exposing individual data. Unlike centralized training that is usually based on carefully-organized data, FL deals with on-device data that are often unfiltered and imbalanced. As a result, conventional FL training protocol that treats all data equally leads to a waste of local computational resources and slows down the global learning process. To this end, we propose FedBalancer, a systematic FL framework that actively selects clients' training samples. Our sample selection strategy prioritizes more "informative" data while respecting privacy and computational capabilities of clients. To better utilize the sample selection to speed up global training, we further introduce an adaptive deadline control scheme that predicts the optimal deadline for each round with varying client training data. Compared with existing FL algorithms with deadline configuration methods, our evaluation on five datasets from three different domains shows that FedBalancer improves the time-to-accuracy performance by 1.20~4.48x while improving the model accuracy by 1.1~5.0%. We also show that FedBalancer is readily applicable to other FL approaches by demonstrating that FedBalancer improves the convergence speed and accuracy when operating jointly with three different FL algorithms.
    Domain-informed neural networks for interaction localization within astroparticle experiments. (arXiv:2112.07995v2 [hep-ex] UPDATED)
    This work proposes a domain-informed neural network architecture for experimental particle physics, using particle interaction localization with the time-projection chamber (TPC) technology for dark matter research as an example application. A key feature of the signals generated within the TPC is that they allow localization of particle interactions through a process called reconstruction. While multilayer perceptrons (MLPs) have emerged as a leading contender for reconstruction in TPCs, such a black-box approach does not reflect prior knowledge of the underlying scientific processes. This paper looks anew at neural network-based interaction localization and encodes prior detector knowledge, in terms of both signal characteristics and detector geometry, into the feature encoding and the output layers of a multilayer neural network. The resulting Domain-informed Neural Network (DiNN) limits the receptive fields of the neurons in the initial feature encoding layers in order to account for the spatially localized nature of the signals produced within the TPC. This aspect of the DiNN, which has similarities with the emerging area of graph neural networks in that the neurons in the initial layers only connect to a handful of neurons in their succeeding layer, significantly reduces the number of parameters in the network in comparison to an MLP. In addition, in order to account for the detector geometry, the output layers of the network are modified using two geometric transformations to ensure the DiNN produces localizations within the interior of the detector. The end result is a neural network architecture that has 60% fewer parameters than an MLP, but that still achieves similar localization performance and provides a path to future architectural developments with improved performance because of their ability to encode additional domain knowledge into the architecture.
    Nonlinear Transform Source-Channel Coding for Semantic Communications. (arXiv:2112.10961v2 [cs.IT] UPDATED)
    In this paper, we propose a class of high-efficiency deep joint source-channel coding methods that can closely adapt to the source distribution under the nonlinear transform, it can be collected under the name nonlinear transform source-channel coding (NTSCC). In the considered model, the transmitter first learns a nonlinear analysis transform to map the source data into latent space, then transmits the latent representation to the receiver via deep joint source-channel coding. Our model incorporates the nonlinear transform as a strong prior to effectively extract the source semantic features and provide side information for source-channel coding. Unlike existing conventional deep joint source-channel coding methods, the proposed NTSCC essentially learns both the source latent representation and an entropy model as the prior on the latent representation. Accordingly, novel adaptive rate transmission and hyperprior-aided codec refinement mechanisms are developed to upgrade deep joint source-channel coding. The whole system design is formulated as an optimization problem whose goal is to minimize the end-to-end transmission rate-distortion performance under established perceptual quality metrics. Across test image sources with various resolutions, we find that the proposed NTSCC transmission method generally outperforms both the analog transmission using the standard deep joint source-channel coding and the classical separation-based digital transmission. Notably, the proposed NTSCC method can potentially support future semantic communications due to its content-aware ability and perceptual optimization goal.
    The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer Linear Networks. (arXiv:2108.11489v2 [stat.ML] UPDATED)
    The recent success of neural network models has shone light on a rather surprising statistical phenomenon: statistical models that perfectly fit noisy data can generalize well to unseen test data. Understanding this phenomenon of $\textit{benign overfitting}$ has attracted intense theoretical and empirical study. In this paper, we consider interpolating two-layer linear neural networks trained with gradient flow on the squared loss and derive bounds on the excess risk when the covariates satisfy sub-Gaussianity and anti-concentration properties, and the noise is independent and sub-Gaussian. By leveraging recent results that characterize the implicit bias of this estimator, our bounds emphasize the role of both the quality of the initialization as well as the properties of the data covariance matrix in achieving low excess risk.
    Classification of COVID-19 on chest X-Ray images using Deep Learning model with Histogram Equalization and Lungs Segmentation. (arXiv:2112.02478v2 [eess.IV] UPDATED)
    Background and Objective: Artificial intelligence (AI) methods coupled with biomedical analysis has a critical role during pandemics as it helps to release the overwhelming pressure from healthcare systems and physicians. As the ongoing COVID-19 crisis worsens in countries having dense populations and inadequate testing kits like Brazil and India, radiological imaging can act as an important diagnostic tool to accurately classify covid-19 patients and prescribe the necessary treatment in due time. With this motivation, we present our study based on deep learning architecture for detecting covid-19 infected lungs using chest X-rays. Dataset: We collected a total of 2470 images for three different class labels, namely, healthy lungs, ordinary pneumonia, and covid-19 infected pneumonia, out of which 470 X-ray images belong to the covid-19 category. Methods: We first pre-process all the images using histogram equalization techniques and segment them using U-net architecture. VGG-16 network is then used for feature extraction from the pre-processed images which is further sampled by SMOTE oversampling technique to achieve a balanced dataset. Finally, the class-balanced features are classified using a support vector machine (SVM) classifier with 10-fold cross-validation and the accuracy is evaluated. Result and Conclusion: Our novel approach combining well-known pre-processing techniques, feature extraction methods, and dataset balancing method, lead us to an outstanding rate of recognition of 98% for COVID-19 images over a dataset of 2470 X-ray images. Our model is therefore fit to be utilized in healthcare facilities for screening purposes.
    Evaluating Generalization in Classical and Quantum Generative Models. (arXiv:2201.08770v2 [cs.LG] UPDATED)
    Defining and accurately measuring generalization in generative models remains an ongoing challenge and a topic of active research within the machine learning community. This is in contrast to discriminative models, where there is a clear definition of generalization, i.e., the model's classification accuracy when faced with unseen data. In this work, we construct a simple and unambiguous approach to evaluate the generalization capabilities of generative models. Using the sample-based generalization metrics proposed here, any generative model, from state-of-the-art classical generative models such as GANs to quantum models such as Quantum Circuit Born Machines, can be evaluated on the same ground on a concrete well-defined framework. In contrast to other sample-based metrics for probing generalization, we leverage constrained optimization problems (e.g., cardinality constrained problems) and use these discrete datasets to define specific metrics capable of unambiguously measuring the quality of the samples and the model's generalization capabilities for generating data beyond the training set but still within the valid solution space. Additionally, our metrics can diagnose trainability issues such as mode collapse and overfitting, as we illustrate when comparing GANs to quantum-inspired models built out of tensor networks. Our simulation results show that our quantum-inspired models have up to a $68 \times$ enhancement in generating unseen unique and valid samples compared to GANs, and a ratio of 61:2 for generating samples with better quality than those observed in the training set. We foresee these metrics as valuable tools for rigorously defining practical quantum advantage in the domain of generative modeling.
    Gaussian Process Sampling and Optimization with Approximate Upper and Lower Bounds. (arXiv:2110.12087v3 [cs.LG] UPDATED)
    Many functions have approximately-known upper and/or lower bounds, potentially aiding the modeling of such functions. In this paper, we introduce Gaussian process models for functions where such bounds are (approximately) known. More specifically, we propose the first use of such bounds to improve Gaussian process (GP) posterior sampling and Bayesian optimization (BO). That is, we transform a GP model satisfying the given bounds, and then sample and weight functions from its posterior. To further exploit these bounds in BO settings, we present bounded entropy search (BES) to select the point gaining the most information about the underlying function, estimated by the GP samples, while satisfying the output constraints. We characterize the sample variance bounds and show that the decision made by BES is explainable. Our proposed approach is conceptually straightforward and can be used as a plug in extension to existing methods for GP posterior sampling and Bayesian optimization.
    Improved Fine-Tuning by Better Leveraging Pre-Training Data. (arXiv:2111.12292v2 [cs.CV] UPDATED)
    As a dominant paradigm, fine-tuning a pre-trained model on the target data is widely used in many deep learning applications, especially for small data sets. However, recent studies have empirically shown that training from scratch has the final performance that is no worse than this pre-training strategy once the number of training samples is increased in some vision tasks. In this work, we revisit this phenomenon from the perspective of generalization analysis by using excess risk bound which is popular in learning theory. The result reveals that the excess risk bound may have a weak dependency on the pre-trained model. The observation inspires us to leverage pre-training data for fine-tuning, since this data is also available for fine-tuning. The generalization result of using pre-training data shows that the excess risk bound on a target task can be improved when the appropriate pre-training data is included in fine-tuning. With the theoretical motivation, we propose a novel selection strategy to select a subset from pre-training data to help improve the generalization on the target task. Extensive experimental results for image classification tasks on 8 benchmark data sets verify the effectiveness of the proposed data selection based fine-tuning pipeline.
    DeepTrack: Lightweight Deep Learning for Vehicle Path Prediction in Highways. (arXiv:2108.00505v2 [cs.LG] UPDATED)
    Vehicle trajectory prediction is essential for enabling safety-critical intelligent transportation systems (ITS) applications used in management and operations. While there have been some promising advances in the field, there is a need for modern deep learning algorithms that allow real-time trajectory prediction on embedded IoT devices. This article presents DeepTrack, a novel deep learning algorithm customized for real-time vehicle trajectory prediction and monitoring applications in arterial management, freeway management, traffic incident management, and work zone management for high-speed incoming traffic. In contrast to previous methods, the vehicle dynamics are encoded using Temporal Convolutional Networks (TCNs) to provide more robust time prediction with less computation. DeepTrack also uses depthwise convolution, which reduces the complexity of models compared to existing approaches in terms of model size and operations. Overall, our experimental results demonstrate that DeepTrack achieves comparable accuracy to state-of-the-art trajectory prediction models but with smaller model sizes and lower computational complexity, making it more suitable for real-world deployment.
    Trustworthy AI: From Principles to Practices. (arXiv:2110.01167v2 [cs.AI] UPDATED)
    The rapid development of Artificial Intelligence (AI) technology has enabled the deployment of various systems based on it. However, many current AI systems are found vulnerable to imperceptible attacks, biased against underrepresented groups, lacking in user privacy protection. These shortcomings degrade user experience and erode people's trust in all AI systems. In this review, we provide AI practitioners with a comprehensive guide for building trustworthy AI systems. We first introduce the theoretical framework of important aspects of AI trustworthiness, including robustness, generalization, explainability, transparency, reproducibility, fairness, privacy preservation, and accountability. To unify currently available but fragmented approaches toward trustworthy AI, we organize them in a systematic approach that considers the entire lifecycle of AI systems, ranging from data acquisition to model development, to system development and deployment, finally to continuous monitoring and governance. In this framework, we offer concrete action items for practitioners and societal stakeholders (e.g., researchers, engineers, and regulators) to improve AI trustworthiness. Finally, we identify key opportunities and challenges for the future development of trustworthy AI systems, where we identify the need for a paradigm shift toward comprehensively trustworthy AI systems.
    FedHM: Efficient Federated Learning for Heterogeneous Models via Low-rank Factorization. (arXiv:2111.14655v2 [cs.LG] UPDATED)
    One underlying assumption of recent federated learning (FL) paradigms is that all local models usually share the same network architecture and size, which becomes impractical for devices with different hardware resources. A scalable federated learning framework should address the heterogeneity that clients have different computing capacities and communication capabilities. To this end, this paper proposes FedHM, a novel heterogeneous federated model compression framework, distributing the heterogeneous low-rank models to clients and then aggregating them into a full-rank model. Our solution enables the training of heterogeneous models with varying computational complexities and aggregates them into a single global model. Furthermore, FedHM significantly reduces the communication cost by using low-rank models. Extensive experimental results demonstrate that FedHM is superior in the performance and robustness of models of different sizes, compared with state-of-the-art heterogeneous FL methods under various FL settings. Additionally, the convergence guarantee of FL for heterogeneous devices is first theoretically analyzed.
    Learning Perceptual Locomotion on Uneven Terrains using Sparse Visual Observations. (arXiv:2109.14026v2 [cs.RO] UPDATED)
    To proactively navigate and traverse various terrains, active use of visual perception becomes indispensable. We aim to investigate the feasibility and performance of using sparse visual observations to achieve perceptual locomotion over a range of common terrains (steps, ramps, gaps, and stairs) in human-centered environments. We formulate a selection of sparse visual inputs suitable for locomotion over the terrains of interest, and propose a learning framework to integrate exteroceptive and proprioceptive states. We specifically design the state observations and a training curriculum to learn feedback control policies effectively over a range of different terrains. We extensively validate and benchmark the learned policy in various tasks: omnidirectional walking on flat ground, and forward locomotion over various obstacles, showing high success rate of traversability. Furthermore, we study exteroceptive ablations and evaluate policy generalization by adding various levels of noise and testing on new unseen terrains. We demonstrate the capabilities of autonomous perceptual locomotion that can be achieved by only using sparse visual observations from direct depth measurements, which are easily available from a Lidar or RGB-D sensor, showing robust ascent and descent over high stairs of 20 cm height, i.e., 50% leg length, and robustness against noise and unseen terrains.
    RKHS-SHAP: Shapley Values for Kernel Methods. (arXiv:2110.09167v2 [stat.ML] UPDATED)
    Feature attribution for kernel methods is often heuristic and not individualised for each prediction. To address this, we turn to the concept of Shapley values~(SV), a coalition game theoretical framework that has previously been applied to different machine learning model interpretation tasks, such as linear models, tree ensembles and deep networks. By analysing SVs from a functional perspective, we propose \textsc{RKHS-SHAP}, an attribution method for kernel machines that can efficiently compute both \emph{Interventional} and \emph{Observational Shapley values} using kernel mean embeddings of distributions. We show theoretically that our method is robust with respect to local perturbations - a key yet often overlooked desideratum for consistent model interpretation. Further, we propose \emph{Shapley regulariser}, applicable to a general empirical risk minimisation framework, allowing learning while controlling the level of specific feature's contributions to the model. We demonstrate that the Shapley regulariser enables learning which is robust to covariate shift of a given feature and fair learning which controls the SVs of sensitive features.
    Towards the Generalization of Contrastive Self-Supervised Learning. (arXiv:2111.00743v3 [cs.LG] UPDATED)
    Recently, self-supervised learning has attracted great attention, since it only requires unlabeled data for training. Contrastive learning is one popular method for self-supervised learning and has achieved promising empirical performance. However, the theoretical understanding of its generalization ability is still limited. To this end, we define a kind of $(\sigma,\delta)$-measure to mathematically quantify the data augmentation, and then provide an upper bound of the downstream classification error based on the measure. We show that the generalization ability of contrastive self-supervised learning depends on three key factors: alignment of positive samples, divergence of class centers, and concentration of augmented data. The first two factors can be optimized by contrastive algorithms, while the third one is priorly determined by pre-defined data augmentation. With the above theoretical findings, we further study two canonical contrastive losses, InfoNCE and cross-correlation loss, and prove that both of them are indeed able to satisfy the first two factors. Moreover, we empirically verify the third factor by conducting various experiments on the real-world dataset, and show that our theoretical inferences on the relationship between the data augmentation and the generalization of contrastive self-supervised learning agree with the empirical observations.
    Amplitude Mean of Functional Data on $\mathbb{S}^2$. (arXiv:2107.13721v4 [stat.ML] UPDATED)
    Manifold-valued functional data analysis (FDA) recently becomes an active area of research motivated by the raising availability of trajectories or longitudinal data observed on non-linear manifolds. The challenges of analyzing such data come from many aspects, including infinite dimensionality and nonlinearity, as well as time-domain or phase variability. In this paper, we study the amplitude part of manifold-valued functions on $\mathbb{S}^2$, which is invariant to random time warping or re-parameterization. Utilizing the nice geometry of $\mathbb{S}^2$, we develop a set of efficient and accurate tools for temporal alignment of functions, geodesic computing, and sample mean calculation. At the heart of these tools, they rely on gradient descent algorithms with carefully derived gradients. We show the advantages of these newly developed tools over its competitors with extensive simulations and real data and demonstrate the importance of considering the amplitude part of functions instead of mixing it with phase variability in manifold-valued FDA.
    TIP: Task-Informed Motion Prediction for Intelligent Vehicles. (arXiv:2110.08750v2 [cs.RO] UPDATED)
    When predicting trajectories of road agents, motion predictors usually approximate the future distribution by a limited number of samples. This constraint requires the predictors to generate samples that best support the task given task specifications. However, existing predictors are often optimized and evaluated via task-agnostic measures without accounting for the use of predictions in downstream tasks, and thus could result in sub-optimal task performance. In this paper, we propose a task-informed motion prediction model that better supports the tasks through its predictions, by jointly reasoning about prediction accuracy and the utility of the downstream tasks, which is commonly used to evaluate the task performance. The task utility function does not require the full task information, but rather a specification of the utility of the task, resulting in predictors that serve a wide range of downstream tasks. We demonstrate our approach on two use cases of common decision making tasks and their utility functions, in the context of autonomous driving and parallel autonomy. Experiment results show that our predictor produces accurate predictions that improve the task performance by a large margin in both tasks when compared to task-agnostic baselines on the Waymo Open Motion dataset.
    Coherent Probabilistic Aggregate Queries on Long-horizon Forecasts. (arXiv:2111.03394v2 [cs.LG] UPDATED)
    Long range forecasts are the starting point of many decision support systems that need to draw inference from high-level aggregate patterns on forecasted values. State of the art time-series forecasting methods are either subject to concept drift on long-horizon forecasts, or fail to accurately predict coherent and accurate high-level aggregates. In this work, we present a novel probabilistic forecasting method that produces forecasts that are coherent in terms of base level and predicted aggregate statistics. We achieve the coherency between predicted base-level and aggregate statistics using a novel inference method based on KL-divergence that can be solved efficiently in closed form. We show that our method improves forecast performance across both base level and unseen aggregates post inference on real datasets ranging three diverse domains. (\href{https://github.com/pratham16cse/AggForecaster}{Project URL})
    Machine Learning Construction: implications to cybersecurity. (arXiv:1906.10019v3 [cs.LG] UPDATED)
    Statistical learning is the process of estimating an unknown probabilistic input-output relationship of a system using a limited number of observations; a statistical learning machine (SLM) is the algorithm, function, model, or rule, that learns such a process; and machine learning (ML) is the conventional name of this field. ML and its applications are ubiquitous in the modern world. Cyberphysical systems such as Automatic target recognition (ATR) in military applications, computer aided diagnosis (CAD) in medical imaging, DNA microarrays in genomics, optical character recognition (OCR), speech recognition (SR), spam email filtering, stock market prediction, etc., are few examples and applications for ML; diverse fields but one theory. In particular, ML has gained a lot of attention in the field of cyberphysical security, especially in the last decade. It is of great importance to this field to design detection algorithms that have the capability of learning from security data to be able to hunt threats, achieve better monitoring, master the complexity of the threat intelligence feeds, and achieve timely remediation of security incidents. The field of ML can be decomposed into two basic subfields: \textit{construction} and \textit{assessment}. We mean by \textit{construction} designing or inventing an appropriate algorithm that learns from the input data and achieves a good performance according to some optimality criterion. We mean by \textit{assessment} attributing some performance measures to the constructed ML algorithm, along with their estimators, to objectively assess this algorithm.
    MixR: Data Mixing Augmentation for Regression. (arXiv:2106.03374v3 [cs.LG] UPDATED)
    Data augmentation is becoming essential for improving regression accuracy in critical applications including manufacturing, climate prediction, and finance. Existing techniques for data augmentation largely focus on classification tasks and do not readily apply to regression tasks. In particular, the recent Mixup techniques for classification have succeeded in improving the model performance, which is reasonable due to the characteristics of the classification task, but has limitations in regression. We show that mixing examples that have large data distances using linear interpolations may have increasingly-negative effects on model performance. Hence, we use the stricter assumption that linearity only holds within certain data distances for regression where the degree may vary by each example. We then propose MixR, a data augmentation framework for regression that learns for each example how many nearest neighbors it should be mixed with for the best model performance using a validation set. Our experiments conducted both on synthetic and real datasets show that MixR significantly outperforms state-of-the-art data augmentation baselines applicable to regression. MixR can also be integrated with existing Mixup techniques to significantly improve their performances.
    Subspace clustering in high-dimensions: Phase transitions \& Statistical-to-Computational gap. (arXiv:2205.13527v1 [stat.ML])
    A simple model to study subspace clustering is the high-dimensional $k$-Gaussian mixture model where the cluster means are sparse vectors. Here we provide an exact asymptotic characterization of the statistically optimal reconstruction error in this model in the high-dimensional regime with extensive sparsity, i.e. when the fraction of non-zero components of the cluster means $\rho$, as well as the ratio $\alpha$ between the number of samples and the dimension are fixed, while the dimension diverges. We identify the information-theoretic threshold below which obtaining a positive correlation with the true cluster means is statistically impossible. Additionally, we investigate the performance of the approximate message passing (AMP) algorithm analyzed via its state evolution, which is conjectured to be optimal among polynomial algorithm for this task. We identify in particular the existence of a statistical-to-computational gap between the algorithm that require a signal-to-noise ratio $\lambda_{\text{alg}} \ge k / \sqrt{\alpha} $ to perform better than random, and the information theoretic threshold at $\lambda_{\text{it}} \approx \sqrt{-k \rho \log{\rho}} / \sqrt{\alpha}$. Finally, we discuss the case of sub-extensive sparsity $\rho$ by comparing the performance of the AMP with other sparsity-enhancing algorithms, such as sparse-PCA and diagonal thresholding.
    Verifying Learning-Based Robotic Navigation Systems. (arXiv:2205.13536v1 [cs.RO])
    Deep reinforcement learning (DRL) has become a dominant deep-learning paradigm for various tasks in which complex policies are learned within reactive systems. In parallel, there has recently been significant research on verifying deep neural networks. However, to date, there has been little work demonstrating the use of modern verification tools on real, DRL-controlled systems. In this case-study paper, we attempt to begin bridging this gap, and focus on the important task of mapless robotic navigation -- a classic robotics problem, in which a robot, usually controlled by a DRL agent, needs to efficiently and safely navigate through an unknown arena towards a desired target. We demonstrate how modern verification engines can be used for effective model selection, i.e., the process of selecting the best available policy for the robot in question from a pool of candidate policies. Specifically, we use verification to detect and rule out policies that may demonstrate suboptimal behavior, such as collisions and infinite loops. We also apply verification to identify models with overly conservative behavior, thus allowing users to choose superior policies that are better at finding an optimal, shorter path to a target. To validate our work, we conducted extensive experiments on an actual robot, and confirmed that the suboptimal policies detected by our method were indeed flawed. We also compared our verification-driven approach to state-of-the-art gradient attacks, and our results demonstrate that gradient-based methods are inadequate in this setting. Our work is the first to demonstrate the use of DNN verification backends for recognizing suboptimal DRL policies in real-world robots, and for filtering out unwanted policies. We believe that the methods presented in this work can be applied to a large range of application domains that incorporate deep-learning-based agents.
    Steerable 3D Spherical Neurons. (arXiv:2106.13863v6 [cs.CV] UPDATED)
    Emerging from low-level vision theory, steerable filters found their counterpart in prior work on steerable convolutional neural networks equivariant to rigid transformations. In our work, we propose a steerable feed-forward learning-based approach that consists of neurons with spherical decision surfaces and operates on point clouds. Such spherical neurons are obtained by conformal embedding of Euclidean space and have recently been revisited in the context of learning representations of point sets. Focusing on 3D geometry, we exploit the isometry property of spherical neurons and derive a 3D steerability constraint. After training spherical neurons to classify point clouds in a canonical orientation, we use a tetrahedron basis to quadruplicate the neurons and construct rotation-equivariant spherical filter banks. We then apply the derived constraint to interpolate the filter bank outputs and, thus, obtain a rotation-invariant network. Finally, we use a synthetic point set and real-world 3D skeleton data to verify our theoretical findings.
    Ranking the information content of distance measures. (arXiv:2104.15079v2 [stat.ML] UPDATED)
    Real-world data typically contain a large number of features that are often heterogeneous in nature, relevance, and also units of measure. When assessing the similarity between data points, one can build various distance measures using subsets of these features. Using the fewest features but still retaining sufficient information about the system is crucial in many statistical learning approaches, particularly when data are sparse. We introduce a statistical test that can assess the relative information retained when using two different distance measures, and determine if they are equivalent, independent, or if one is more informative than the other. This in turn allows finding the most informative distance measure out of a pool of candidates. The approach is applied to find the most relevant policy variables for controlling the Covid-19 epidemic and to find compact yet informative representations of atomic structures, but its potential applications are wide ranging in many branches of science.
    Mitigating barren plateaus of variational quantum eigensolvers. (arXiv:2205.13539v1 [quant-ph])
    Variational quantum algorithms (VQAs) are expected to establish valuable applications on near-term quantum computers. However, recent works have pointed out that the performance of VQAs greatly relies on the capability of the ansatzes and is seriously limited by optimization issues such as barren plateaus (i.e., vanishing gradients). This work proposes the state efficient ansatz (SEA) for accurate quantum dynamics simulations with improved trainability. First, we show that SEA can generate an arbitrary pure state with much fewer parameters than a universal ansatz, making it efficient for tasks like ground state estimation. It also has the flexibility in adjusting the entanglement of the prepared state, which could be applied to further improve the efficiency of simulating weak entanglement. Second, we show that SEA is not a unitary 2-design even if it has universal wavefunction expressibility and thus has great potential to improve the trainability by avoiding the zone of barren plateaus. We further investigate a plethora of examples in ground state estimation and notably obtain significant improvements in the variances of derivatives and the overall optimization behaviors. This result indicates that SEA can mitigate barren plateaus by sacrificing the redundant expressibility for the target problem.
    Epistemic Neural Networks. (arXiv:2107.08924v3 [cs.LG] UPDATED)
    Intelligence relies on an agent's knowledge of what it does not know. This capability can be assessed based on the quality of \textit{joint} predictions of labels across multiple inputs. Conventional neural networks lack this capability and, since most research has focused on marginal predictions, this shortcoming has been largely overlooked. We introduce the \textit{epistemic neural network} (ENN) as an interface for models that represent uncertainty as required to generate useful joint predictions. While prior approaches to uncertainty modeling such as Bayesian neural networks can be expressed as ENNs, this new interface facilitates comparison of joint predictions and the design of novel architectures and algorithms. In particular, we introduce the \textit{epinet}: an architecture that can supplement any conventional neural network, including large pretrained models, and can be trained with modest incremental computation to estimate uncertainty. With an epinet, conventional neural networks outperform very large ensembles, consisting of hundreds or more particles, with orders of magnitude less computation. We demonstrate this efficacy across synthetic data, ImageNet, and some reinforcement learning tasks. As part of this effort we open-source experiment code.
    TrustyAI Explainability Toolkit. (arXiv:2104.12717v2 [cs.AI] UPDATED)
    Artificial intelligence (AI) is becoming increasingly more popular and can be found in workplaces and homes around the world. The decisions made by such "black box" systems are often opaque; that is, so complex as to be functionally impossible to understand. How do we ensure that these systems are behaving as desired? TrustyAI is an initiative which looks into explainable artificial intelligence (XAI) solutions to address this issue of explainability in the context of both AI models and decision services. This paper presents the TrustyAI Explainability Toolkit, a Java and Python library that provides XAI explanations of decision services and predictive models for both enterprise and data science use-cases. We describe the TrustyAI implementations and extensions to techniques such as LIME, SHAP and counterfactuals, which are benchmarked against existing implementations in a variety of experiments.
    Machine Learning Assessment: implications to cybersecurity. (arXiv:1907.12851v4 [stat.ML] UPDATED)
    This chapter is dedicated to the assessment and performance estimation of machine learning (ML) algorithms, a topic that is equally important to the construction of these algorithms, in particular in the context of cyberphysical security design. The literature is full of nonparametric methods to estimate a statistic from just one available dataset through resampling techniques, e.g., jackknife, bootstrap and cross validation (CV). Special statistics of great interest are the error rate and the area under the ROC curve (AUC) of a classification rule. The importance of these resampling methods stems from the fact that they require no knowledge about the probability distribution of the data or the construction details of the ML algorithm. This chapter provides a concise review of this literature to establish a coherent theoretical framework for these methods that can estimate both the error rate (a one-sample statistic) and the AUC (a two-sample statistic). The resampling methods are usually computationally expensive, because they rely on repeating the training and testing of a ML algorithm after each resampling iteration. Therefore, the practical applicability of some of these methods may be limited to the traditional ML algorithms rather than the very computationally demanding approaches of the recent deep neural networks (DNN). In the field of cyberphysical security, many applications generate structured (tabular) data, which can be fed to all traditional ML approaches. This is in contrast to the DNN approaches, which favor unstructured data, e.g., images, text, voice, etc.; hence, the relevance of this chapter to this field.%
    TempoRL: Temporal Priors for Exploration in Off-Policy Reinforcement Learning. (arXiv:2205.13528v1 [cs.LG])
    Efficient exploration is a crucial challenge in deep reinforcement learning. Several methods, such as behavioral priors, are able to leverage offline data in order to efficiently accelerate reinforcement learning on complex tasks. However, if the task at hand deviates excessively from the demonstrated task, the effectiveness of such methods is limited. In our work, we propose to learn features from offline data that are shared by a more diverse range of tasks, such as correlation between actions and directedness. Therefore, we introduce state-independent temporal priors, which directly model temporal consistency in demonstrated trajectories, and are capable of driving exploration in complex tasks, even when trained on data collected on simpler tasks. Furthermore, we introduce a novel integration scheme for action priors in off-policy reinforcement learning by dynamically sampling actions from a probabilistic mixture of policy and action prior. We compare our approach against strong baselines and provide empirical evidence that it can accelerate reinforcement learning in long-horizon continuous control tasks under sparse reward settings.
    Spherical Message Passing for 3D Graph Networks. (arXiv:2102.05013v3 [cs.LG] UPDATED)
    We consider representation learning from 3D graphs in which each node is associated with a spatial position in 3D. This is an under explored area of research, and a principled framework is currently lacking. In this work, we propose a generic framework, known as the 3D graph network (3DGN), to provide a unified interface at different levels of granularity for 3D graphs. Built on 3DGN, we propose the spherical message passing (SMP) as a novel and specific scheme for realizing the 3DGN framework in the spherical coordinate system (SCS). We conduct formal analyses and show that the relative location of each node in 3D graphs is uniquely defined in the SMP scheme. Thus, our SMP represents a complete and accurate architecture for learning from 3D graphs in the SCS. We derive physically-based representations of geometric information and propose the SphereNet for learning representations of 3D graphs. We show that existing 3D deep models can be viewed as special cases of the SphereNet. Experimental results demonstrate that the use of complete and accurate 3D information in 3DGN and SphereNet leads to significant performance improvements in prediction tasks.
    Selective Classification Via Neural Network Training Dynamics. (arXiv:2205.13532v1 [cs.LG])
    Selective classification is the task of rejecting inputs a model would predict incorrectly on through a trade-off between input space coverage and model accuracy. Current methods for selective classification impose constraints on either the model architecture or the loss function; this inhibits their usage in practice. In contrast to prior work, we show that state-of-the-art selective classification performance can be attained solely from studying the (discretized) training dynamics of a model. We propose a general framework that, for a given test input, monitors metrics capturing the disagreement with the final predicted label over intermediate models obtained during training; we then reject data points exhibiting too much disagreement at late stages in training. In particular, we instantiate a method that tracks when the label predicted during training stops disagreeing with the final predicted label. Our experimental evaluation shows that our method achieves state-of-the-art accuracy/coverage trade-offs on typical selective classification benchmarks. For example, we improve coverage on CIFAR-10/SVHN by 10.1%/1.5% respectively at a fixed target error of 0.5%.
    Revealing the Dark Secrets of Masked Image Modeling. (arXiv:2205.13543v1 [cs.CV])
    Masked image modeling (MIM) as pre-training is shown to be effective for numerous vision downstream tasks, but how and where MIM works remain unclear. In this paper, we compare MIM with the long-dominant supervised pre-trained models from two perspectives, the visualizations and the experiments, to uncover their key representational differences. From the visualizations, we find that MIM brings locality inductive bias to all layers of the trained models, but supervised models tend to focus locally at lower layers but more globally at higher layers. That may be the reason why MIM helps Vision Transformers that have a very large receptive field to optimize. Using MIM, the model can maintain a large diversity on attention heads in all layers. But for supervised models, the diversity on attention heads almost disappears from the last three layers and less diversity harms the fine-tuning performance. From the experiments, we find that MIM models can perform significantly better on geometric and motion tasks with weak semantics or fine-grained classification tasks, than their supervised counterparts. Without bells and whistles, a standard MIM pre-trained SwinV2-L could achieve state-of-the-art performance on pose estimation (78.9 AP on COCO test-dev and 78.0 AP on CrowdPose), depth estimation (0.287 RMSE on NYUv2 and 1.966 RMSE on KITTI), and video object tracking (70.7 SUC on LaSOT). For the semantic understanding datasets where the categories are sufficiently covered by the supervised pre-training, MIM models can still achieve highly competitive transfer performance. With a deeper understanding of MIM, we hope that our work can inspire new and solid research in this direction.
    Memory AMP. (arXiv:2012.10861v6 [cs.IT] UPDATED)
    Approximate message passing (AMP) is a low-cost iterative parameter-estimation technique for certain high-dimensional linear systems with non-Gaussian distributions. However, AMP only applies to independent identically distributed (IID) transform matrices, but may become unreliable (e.g., perform poorly or even diverge) for other matrix ensembles, especially for ill-conditioned ones. Orthogonal/vector AMP (OAMP/VAMP) was proposed for general right-unitarily-invariant matrices to handle this difficulty. However, the Bayes-optimal OAMP/VAMP (BO-OAMP/VAMP) requires a high-complexity linear minimum mean square error (MMSE) estimator. This limits the application of OAMP/VAMP to large-scale systems. To solve the disadvantages of AMP and BO-OAMP/VAMP, this paper proposes a memory AMP (MAMP) framework under an orthogonality principle, which guarantees the asymptotic IID Gaussianity of estimation errors in MAMP. We present an orthogonalization procedure for the local memory estimators to realize the required orthogonality for MAMP. Furthermore, we propose a Bayes-optimal MAMP (BO-MAMP), in which a long-memory matched filter is proposed for interference suppression. The complexity of BO-MAMP is comparable to AMP. A state evolution is derived to asymptotically characterize the performance of BO-MAMP. Based on state evolution, the relaxation parameters and damping vector in BO-MAMP are optimized. For all right-unitarily-invariant matrices, the state evolution of the optimized BO-MAMP converges to the same fixed point as that of the high-complexity BO-OAMP/VAMP and is Bayes-optimal if its state evolution has a unique fixed point. Finally, simulations are provided to verify the validity and accuracy of the theoretical results.
    Semantic Parsing of Interpage Relations. (arXiv:2205.13530v1 [cs.LG])
    Page-level analysis of documents has been a topic of interest in digitization efforts, and multimodal approaches have been applied to both classification and page stream segmentation. In this work, we focus on capturing finer semantic relations between pages of a multi-page document. To this end, we formalize the task as semantic parsing of interpage relations and we propose an end-to-end approach for interpage dependency extraction, inspired by the dependency parsing literature. We further design a multi-task training approach to jointly optimize for page embeddings to be used in segmentation, classification, and parsing of the page dependencies using textual and visual features extracted from the pages. Moreover, we also combine the features from two modalities to obtain multimodal page embeddings. To the best of our knowledge, this is the first study to extract rich semantic interpage relations from multi-page documents. Our experimental results show that the proposed method increased LAS by 41 percentage points for semantic parsing, increased accuracy by 33 percentage points for page stream segmentation, and 45 percentage points for page classification over a naive baseline.
    Transfer learning driven design optimization for inertial confinement fusion. (arXiv:2205.13519v1 [physics.plasm-ph])
    Transfer learning is a promising approach to creating predictive models that incorporate simulation and experimental data into a common framework. In this technique, a neural network is first trained on a large database of simulations, then partially retrained on sparse sets of experimental data to adjust predictions to be more consistent with reality. Previously, this technique has been used to create predictive models of Omega and NIF inertial confinement fusion (ICF) experiments that are more accurate than simulations alone. In this work, we conduct a transfer learning driven hypothetical ICF campaign in which the goal is to maximize experimental neutron yield via Bayesian optimization. The transfer learning model achieves yields within 5% of the maximum achievable yield in a modest-sized design space in fewer than 20 experiments. Furthermore, we demonstrate that this method is more efficient at optimizing designs than traditional model calibration techniques commonly employed in ICF design. Such an approach to ICF design could enable robust optimization of experimental performance under uncertainty.
    Sparse Graph Learning for Spatiotemporal Time Series. (arXiv:2205.13492v1 [cs.LG])
    Outstanding achievements of graph neural networks for spatiotemporal time series prediction show that relational constraints introduce a positive inductive bias into neural forecasting architectures. Often, however, the relational information characterizing the underlying data generating process is unavailable; the practitioner is then left with the problem of inferring from data which relational graph to use in the subsequent processing stages. We propose novel, principled -- yet practical -- probabilistic methods that learn the relational dependencies by modeling distributions over graphs while maximizing, at the same time, end-to-end the forecasting accuracy. Our novel graph learning approach, based on consolidated variance reduction techniques for Monte Carlo score-based gradient estimation, is theoretically grounded and effective. We show that tailoring the gradient estimators to the graph learning problem allows us also for achieving state-of-the-art forecasting performance while controlling, at the same time, both the sparsity of the learned graph and the computational burden. We empirically assess the effectiveness of the proposed method on synthetic and real-world benchmarks, showing that the proposed solution can be used as a stand-alone graph identification procedure as well as a learned component of an end-to-end forecasting architecture.
    AutoTSG: Learning and Synthesis for Incident Troubleshooting. (arXiv:2205.13457v1 [cs.SE])
    Incident management is a key aspect of operating large-scale cloud services. To aid with faster and efficient resolution of incidents, engineering teams document frequent troubleshooting steps in the form of Troubleshooting Guides (TSGs), to be used by on-call engineers (OCEs). However, TSGs are siloed, unstructured, and often incomplete, requiring developers to manually understand and execute necessary steps. This results in a plethora of issues such as on-call fatigue, reduced productivity, and human errors. In this work, we conduct a large-scale empirical study of over 4K+ TSGs mapped to 1000s of incidents and find that TSGs are widely used and help significantly reduce mitigation efforts. We then analyze feedback on TSGs provided by 400+ OCEs and propose a taxonomy of issues that highlights significant gaps in TSG quality. To alleviate these gaps, we investigate the automation of TSGs and propose AutoTSG -- a novel framework for automation of TSGs to executable workflows by combining machine learning and program synthesis. Our evaluation of AutoTSG on 50 TSGs shows the effectiveness in both identifying TSG statements (accuracy 0.89) and parsing them for execution (precision 0.94 and recall 0.91). Lastly, we survey ten Microsoft engineers and show the importance of TSG automation and the usefulness of AutoTSG.
    Pick up the PACE: Fast and Simple Domain Adaptation via Ensemble Pseudo-Labeling. (arXiv:2205.13508v1 [cs.LG])
    Domain Adaptation (DA) has received widespread attention from deep learning researchers in recent years because of its potential to improve test accuracy with out-of-distribution labeled data. Most state-of-the-art DA algorithms require an extensive amount of hyperparameter tuning and are computationally intensive due to the large batch sizes required. In this work, we propose a fast and simple DA method consisting of three stages: (1) domain alignment by covariance matching, (2) pseudo-labeling, and (3) ensembling. We call this method $\textbf{PACE}$, for $\textbf{P}$seudo-labels, $\textbf{A}$lignment of $\textbf{C}$ovariances, and $\textbf{E}$nsembles. PACE is trained on top of fixed features extracted from an ensemble of modern pretrained backbones. PACE exceeds previous state-of-the-art by $\textbf{5 - 10 \%}$ on most benchmark adaptation tasks without training a neural network. PACE reduces training time and hyperparameter tuning time by $82\%$ and $97\%$, respectively, when compared to state-of-the-art DA methods. Code is released here: https://github.com/Chris210634/PACE-Domain-Adaptation
    Training ReLU networks to high uniform accuracy is intractable. (arXiv:2205.13531v1 [cs.LG])
    Statistical learning theory provides bounds on the necessary number of training samples needed to reach a prescribed accuracy in a learning problem formulated over a given target class. This accuracy is typically measured in terms of a generalization error, that is, an expected value of a given loss function. However, for several applications -- for example in a security-critical context or for problems in the computational sciences -- accuracy in this sense is not sufficient. In such cases, one would like to have guarantees for high accuracy on every input value, that is, with respect to the uniform norm. In this paper we precisely quantify the number of training samples needed for any conceivable training algorithm to guarantee a given uniform accuracy on any learning problem formulated over target classes containing (or consisting of) ReLU neural networks of a prescribed architecture. We prove that, under very general assumptions, the minimal number of training samples for this task scales exponentially both in the depth and the input dimension of the network architecture. As a corollary we conclude that the training of ReLU neural networks to high uniform accuracy is intractable. In a security-critical context this points to the fact that deep learning based systems are prone to being fooled by a possible adversary. We corroborate our theoretical findings by numerical results.
    Learning to Reconstruct Missing Data from Spatiotemporal Graphs with Sparse Observations. (arXiv:2205.13479v1 [cs.LG])
    Modeling multivariate time series as temporal signals over a (possibly dynamic) graph is an effective representational framework that allows for developing models for time series analysis. In fact, discrete sequences of graphs can be processed by autoregressive graph neural networks to recursively learn representations at each discrete point in time and space. Spatiotemporal graphs are often highly sparse, with time series characterized by multiple, concurrent, and even long sequences of missing data, e.g., due to the unreliable underlying sensor network. In this context, autoregressive models can be brittle and exhibit unstable learning dynamics. The objective of this paper is, then, to tackle the problem of learning effective models to reconstruct, i.e., impute, missing data points by conditioning the reconstruction only on the available observations. In particular, we propose a novel class of attention-based architectures that, given a set of highly sparse discrete observations, learn a representation for points in time and space by exploiting a spatiotemporal diffusion architecture aligned with the imputation task. Representations are trained end-to-end to reconstruct observations w.r.t. the corresponding sensor and its neighboring nodes. Compared to the state of the art, our model handles sparse data without propagating prediction errors or requiring a bidirectional model to encode forward and backward time dependencies. Empirical results on representative benchmarks show the effectiveness of the proposed method.
    Censored Quantile Regression Neural Networks. (arXiv:2205.13496v1 [stat.ML])
    This paper considers doing quantile regression on censored data using neural networks (NNs). This adds to the survival analysis toolkit by allowing direct prediction of the target variable, along with a distribution-free characterisation of uncertainty, using a flexible function approximator. We begin by showing how an algorithm popular in linear models can be applied to NNs. However, the resulting procedure is inefficient, requiring sequential optimisation of an individual NN at each desired quantile. Our major contribution is a novel algorithm that simultaneously optimises a grid of quantiles output by a single NN. To offer theoretical insight into our algorithm, we show firstly that it can be interpreted as a form of expectation-maximisation, and secondly that it exhibits a desirable `self-correcting' property. Experimentally, the algorithm produces quantiles that are better calibrated than existing methods on 10 out of 12 real datasets.
    Mesoscopic modeling of hidden spiking neurons. (arXiv:2205.13493v1 [q-bio.NC])
    Can we use spiking neural networks (SNN) as generative models of multi-neuronal recordings, while taking into account that most neurons are unobserved? Modeling the unobserved neurons with large pools of hidden spiking neurons leads to severely underconstrained problems that are hard to tackle with maximum likelihood estimation. In this work, we use coarse-graining and mean-field approximations to derive a bottom-up, neuronally-grounded latent variable model (neuLVM), where the activity of the unobserved neurons is reduced to a low-dimensional mesoscopic description. In contrast to previous latent variable models, neuLVM can be explicitly mapped to a recurrent, multi-population SNN, giving it a transparent biological interpretation. We show, on synthetic spike trains, that a few observed neurons are sufficient for neuLVM to perform efficient model inversion of large SNNs, in the sense that it can recover connectivity parameters, infer single-trial latent population activity, reproduce ongoing metastable dynamics, and generalize when subjected to perturbations mimicking photo-stimulation.
    FedAug: Reducing the Local Learning Bias Improves Federated Learning on Heterogeneous Data. (arXiv:2205.13462v1 [cs.LG])
    Federated Learning (FL) is a machine learning paradigm that learns from data kept locally to safeguard the privacy of clients, whereas local SGD is typically employed on the clients' devices to improve communication efficiency. However, such a scheme is currently constrained by the slow and unstable convergence induced by clients' heterogeneous data. In this work, we identify three under-explored phenomena of the biased local learning that may explain these challenges caused by local updates in supervised FL. As a remedy, we propose FedAug, a novel unified algorithm that reduces the local learning bias on features and classifiers to tackle these challenges. FedAug consists of two components: AugMean and AugCA. AugMean alleviates the bias in the local classifiers by balancing the output distribution of models. AugCA learns client invariant features that are close to global features but considerably distinct from those learned from other input distributions. In a series of experiments, we show that FedAug consistently outperforms other SOTA FL and domain generalization (DG) baselines, in which both two components (i.e., AugMean and AugCA) have individual performance gains.
    SigMaNet: One Laplacian to Rule Them All. (arXiv:2205.13459v1 [cs.LG])
    This paper introduces SigMaNet, a generalized Graph Convolutional Network (GCN) capable of handling both undirected and directed graphs with weights not restricted in sign and magnitude. The cornerstone of SigMaNet is the introduction of a generalized Laplacian matrix: the Sign-Magnetic Laplacian ($L^\sigma$). The adoption of such a matrix allows us to bridge a gap in the current literature by extending the theory of spectral GCNs to directed graphs with both positive and negative weights. $L^{\sigma}$ exhibits several desirable properties not enjoyed by the traditional Laplacian matrices on which several state-of-the-art architectures are based. In particular, $L^\sigma$ is completely parameter-free, which is not the case of Laplacian operators such as the Magnetic Laplacian $L^{(q)}$, where the calibration of the parameter q is an essential yet problematic component of the operator. $L^\sigma$ simplifies the approach, while also allowing for a natural interpretation of the signs of the edges in terms of their directions. The versatility of the proposed approach is amply demonstrated experimentally; the proposed network SigMaNet turns out to be competitive in all the tasks we considered, regardless of the graph structure.
    SemAffiNet: Semantic-Affine Transformation for Point Cloud Segmentation. (arXiv:2205.13490v1 [cs.CV])
    Conventional point cloud semantic segmentation methods usually employ an encoder-decoder architecture, where mid-level features are locally aggregated to extract geometric information. However, the over-reliance on these class-agnostic local geometric representations may raise confusion between local parts from different categories that are similar in appearance or spatially adjacent. To address this issue, we argue that mid-level features can be further enhanced with semantic information, and propose semantic-affine transformation that transforms features of mid-level points belonging to different categories with class-specific affine parameters. Based on this technique, we propose SemAffiNet for point cloud semantic segmentation, which utilizes the attention mechanism in the Transformer module to implicitly and explicitly capture global structural knowledge within local parts for overall comprehension of each category. We conduct extensive experiments on the ScanNetV2 and NYUv2 datasets, and evaluate semantic-affine transformation on various 3D point cloud and 2D image segmentation baselines, where both qualitative and quantitative results demonstrate the superiority and generalization ability of our proposed approach. Code is available at https://github.com/wangzy22/SemAffiNet.
    An Analytic Framework for Robust Training of Artificial Neural Networks. (arXiv:2205.13502v1 [cs.LG])
    The reliability of a learning model is key to the successful deployment of machine learning in various industries. Creating a robust model, particularly one unaffected by adversarial attacks, requires a comprehensive understanding of the adversarial examples phenomenon. However, it is difficult to describe the phenomenon due to the complicated nature of the problems in machine learning. Consequently, many studies investigate the phenomenon by proposing a simplified model of how adversarial examples occur and validate it by predicting some aspect of the phenomenon. While these studies cover many different characteristics of the adversarial examples, they have not reached a holistic approach to the geometric and analytic modeling of the phenomenon. This paper propose a formal framework to study the phenomenon in learning theory and make use of complex analysis and holomorphicity to offer a robust learning rule for artificial neural networks. With the help of complex analysis, we can effortlessly move between geometric and analytic perspectives of the phenomenon and offer further insights on the phenomenon by revealing its connection with harmonic functions. Using our model, we can explain some of the most intriguing characteristics of adversarial examples, including transferability of adversarial examples, and pave the way for novel approaches to mitigate the effects of the phenomenon.
    Mutual Information Divergence: A Unified Metric for Multimodal Generative Models. (arXiv:2205.13445v1 [cs.CV])
    Text-to-image generation and image captioning are recently emerged as a new experimental paradigm to assess machine intelligence. They predict continuous quantity accompanied by their sampling techniques in the generation, making evaluation complicated and intractable to get marginal distributions. Based on a recent trend that multimodal generative evaluations exploit a vison-and-language pre-trained model, we propose the negative Gaussian cross-mutual information using the CLIP features as a unified metric, coined by Mutual Information Divergence (MID). To validate, we extensively compare it with competing metrics using carefully-generated or human-annotated judgments in text-to-image generation and image captioning tasks. The proposed MID significantly outperforms the competitive methods by having consistency across benchmarks, sample parsimony, and robustness toward the exploited CLIP model. We look forward to seeing the underrepresented implications of the Gaussian cross-mutual information in multimodal representation learning and the future works based on this novel proposition.
    DeepJoint: Robust Survival Modelling Under Clinical Presence Shift. (arXiv:2205.13481v1 [cs.LG])
    Observational data in medicine arise as a result of the complex interaction between patients and the healthcare system. The sampling process is often highly irregular and itself constitutes an informative process. When using such data to develop prediction models, this phenomenon is often ignored, leading to sub-optimal performance and generalisability of models when practices evolve. We propose a multi-task recurrent neural network which models three clinical presence dimensions -- namely the longitudinal, the inter-observation and the missingness processes -- in parallel to the survival outcome. On a prediction task using MIMIC III laboratory tests, explicit modelling of these three processes showed improved performance in comparison to state-of-the-art predictive models (C-index at 1 day horizon: 0.878). More importantly, the proposed approach was more robust to change in the clinical presence setting, demonstrated by performance comparison between patients admitted on weekdays and weekends. This analysis demonstrates the importance of studying and leveraging clinical presence to improve performance and create more transportable clinical models.
    Machine Learning Models Are Not Necessarily Biased When Constructed Properly: Evidence from Neuroimaging Studies. (arXiv:2205.13421v1 [cs.LG])
    Despite the great promise that machine learning has offered in many fields of medicine, it has also raised concerns about potential biases and poor generalization across genders, age distributions, races and ethnicities, hospitals, and data acquisition equipment and protocols. In the current study, and in the context of three brain diseases, we provide experimental data which support that when properly trained, machine learning models can generalize well across diverse conditions and do not suffer from biases. Specifically, by using multi-study magnetic resonance imaging consortia for diagnosing Alzheimer's disease, schizophrenia, and autism spectrum disorder, we find that, the accuracy of well-trained models is consistent across different subgroups pertaining to attributes such as gender, age, and racial groups, as also different clinical studies. We find that models that incorporate multi-source data from demographic, clinical, genetic factors and cognitive scores are also unbiased. These models have better predictive accuracy across subgroups than those trained only with structural measures in some cases but there are also situations when these additional features do not help.
    A Fair Federated Learning Framework With Reinforcement Learning. (arXiv:2205.13415v1 [cs.LG])
    Federated learning (FL) is a paradigm where many clients collaboratively train a model under the coordination of a central server, while keeping the training data locally stored. However, heterogeneous data distributions over different clients remain a challenge to mainstream FL algorithms, which may cause slow convergence, overall performance degradation and unfairness of performance across clients. To address these problems, in this study we propose a reinforcement learning framework, called PG-FFL, which automatically learns a policy to assign aggregation weights to clients. Additionally, we propose to utilize Gini coefficient as the measure of fairness for FL. More importantly, we apply the Gini coefficient and validation accuracy of clients in each communication round to construct a reward function for the reinforcement learning. Our PG-FFL is also compatible to many existing FL algorithms. We conduct extensive experiments over diverse datasets to verify the effectiveness of our framework. The experimental results show that our framework can outperform baseline methods in terms of overall performance, fairness and convergence speed.
    Avoiding Barren Plateaus with Classical Deep Neural Networks. (arXiv:2205.13418v1 [quant-ph])
    Variational quantum algorithms (VQAs) are among the most promising algorithms in the era of Noisy Intermediate Scale Quantum Devices. The VQAs are applied to a variety of tasks, such as in chemistry simulations, optimization problems, and quantum neural networks. Such algorithms are constructed using a parameterization U($\pmb{\theta}$) with a classical optimizer that updates the parameters $\pmb{\theta}$ in order to minimize a cost function $C$. For this task, in general the gradient descent method, or one of its variants, is used. This is a method where the circuit parameters are updated iteratively using the cost function gradient. However, several works in the literature have shown that this method suffers from a phenomenon known as the Barren Plateaus (BP). This phenomenon is characterized by the exponentially flattening of the cost function landscape, so that the number of times the function must be evaluated to perform the optimization grows exponentially as the number of qubits and parameterization depth increase. In this article, we report on how the use of a classical neural networks in the VQAs input parameters can alleviate the BP phenomenon.
    TransBoost: Improving the Best ImageNet Performance using Deep Transduction. (arXiv:2205.13331v1 [cs.CV])
    This paper deals with deep transductive learning, and proposes TransBoost as a procedure for fine-tuning any deep neural model to improve its performance on any (unlabeled) test set provided at training time. TransBoost is inspired by a large margin principle and is efficient and simple to use. The ImageNet classification performance is consistently and significantly improved with TransBoost on many architectures such as ResNets, MobileNetV3-L, EfficientNetB0, ViT-S, and ConvNext-T. Additionally we show that TransBoost is effective on a wide variety of image classification datasets.
    BppAttack: Stealthy and Efficient Trojan Attacks against Deep Neural Networks via Image Quantization and Contrastive Adversarial Learning. (arXiv:2205.13383v1 [cs.CV])
    Deep neural networks are vulnerable to Trojan attacks. Existing attacks use visible patterns (e.g., a patch or image transformations) as triggers, which are vulnerable to human inspection. In this paper, we propose stealthy and efficient Trojan attacks, BppAttack. Based on existing biology literature on human visual systems, we propose to use image quantization and dithering as the Trojan trigger, making imperceptible changes. It is a stealthy and efficient attack without training auxiliary models. Due to the small changes made to images, it is hard to inject such triggers during training. To alleviate this problem, we propose a contrastive learning based approach that leverages adversarial attacks to generate negative sample pairs so that the learned trigger is precise and accurate. The proposed method achieves high attack success rates on four benchmark datasets, including MNIST, CIFAR-10, GTSRB, and CelebA. It also effectively bypasses existing Trojan defenses and human inspection. Our code can be found in https://github.com/RU-System-Software-and-Security/BppAttack.
    Opinion Spam Detection: A New Approach Using Machine Learning and Network-Based Algorithms. (arXiv:2205.13422v1 [cs.LG])
    E-commerce is the fastest-growing segment of the economy. Online reviews play a crucial role in helping consumers evaluate and compare products and services. As a result, fake reviews (opinion spam) are becoming more prevalent and negatively impacting customers and service providers. There are many reasons why it is hard to identify opinion spammers automatically, including the absence of reliable labeled data. This limitation precludes an off-the-shelf application of a machine learning pipeline. We propose a new method for classifying reviewers as spammers or benign, combining machine learning with a message-passing algorithm that capitalizes on the users' graph structure to compensate for the possible scarcity of labeled data. We devise a new way of sampling the labels for the training step (active learning), replacing the typical uniform sampling. Experiments on three large real-world datasets from Yelp.com show that our method outperforms state-of-the-art active learning approaches and also machine learning methods that use a much larger set of labeled data for training.
    Looking for Out-of-Distribution Environments in Critical Care: A case study with the eICU Database. (arXiv:2205.13398v1 [cs.LG])
    Generalizing to new populations and domains in machine learning is still an open problem which has seen increased interest recently. In particular, clinical models show a significant performance drop when tested in settings not seen during training, e.g., new hospitals or population demographics. Recently proposed models for domain generalisation promise to alleviate this problem by learning invariant characteristics across environments, however, there is still scepticism about whether they improve over traditional training. In this work, we take a principled approach to identifying Out of Distribution (OoD) environments, motivated by the problem of cross-hospital generalization in critical care. We propose model-based and heuristic approaches to identify OoD environments and systematically compare models with different levels of held-out information. In particular, based on the assumption that models with access to OoD data should outperform other models, we train models across a range of experimental setups that include leave-one-hospital-out training and cross-sectional feature splits. We find that access to OoD data does not translate to increased performance, pointing to inherent limitations in defining potential OoD environments in the eICU Database potentially due to data harmonisation and sampling. Echoing similar results with other popular clinical benchmarks in the literature, new approaches are required to evaluate robust models in critical care.
    Transfer and Share: Semi-Supervised Learning from Long-Tailed Data. (arXiv:2205.13358v1 [cs.LG])
    Long-Tailed Semi-Supervised Learning (LTSSL) aims to learn from class-imbalanced data where only a few samples are annotated. Existing solutions typically require substantial cost to solve complex optimization problems, or class-balanced undersampling which can result in information loss. In this paper, we present the TRAS (TRAnsfer and Share) to effectively utilize long-tailed semi-supervised data. TRAS transforms the imbalanced pseudo-label distribution of a traditional SSL model via a delicate function to enhance the supervisory signals for minority classes. It then transfers the distribution to a target model such that the minority class will receive significant attention. Interestingly, TRAS shows that more balanced pseudo-label distribution can substantially benefit minority-class training, instead of seeking to generate accurate pseudo-labels as in previous works. To simplify the approach, TRAS merges the training of the traditional SSL model and the target model into a single procedure by sharing the feature extractor, where both classifiers help improve the representation learning. According to extensive experiments, TRAS delivers much higher accuracy than state-of-the-art methods in the entire set of classes as well as minority classes.
    Feature Forgetting in Continual Representation Learning. (arXiv:2205.13359v1 [cs.LG])
    In continual and lifelong learning, good representation learning can help increase performance and reduce sample complexity when learning new tasks. There is evidence that representations do not suffer from "catastrophic forgetting" even in plain continual learning, but little further fact is known about its characteristics. In this paper, we aim to gain more understanding about representation learning in continual learning, especially on the feature forgetting problem. We devise a protocol for evaluating representation in continual learning, and then use it to present an overview of the basic trends of continual representation learning, showing its consistent deficiency and potential issues. To study the feature forgetting problem, we create a synthetic dataset to identify and visualize the prevalence of feature forgetting in neural networks. Finally, we propose a simple technique using gating adapters to mitigate feature forgetting. We conclude by discussing that improving representation learning benefits both old and new tasks in continual learning.
    How Powerful are K-hop Message Passing Graph Neural Networks. (arXiv:2205.13328v1 [cs.LG])
    The most popular design paradigm for Graph Neural Networks (GNNs) is 1-hop message passing -- aggregating features from 1-hop neighbors repeatedly. However, the expressive power of 1-hop message passing is bounded by the Weisfeiler-Lehman (1-WL) test. Recently, researchers extended 1-hop message passing to K-hop message passing by aggregating information from K-hop neighbors of nodes simultaneously. However, there is no work on analyzing the expressive power of K-hop message passing. In this work, we theoretically characterize the expressive power of K-hop message passing. Specifically, we first formally differentiate two kinds of kernels of K-hop message passing which are often misused in previous works. We then characterize the expressive power of K-hop message passing by showing that it is more powerful than 1-hop message passing. Despite the higher expressive power, we show that K-hop message passing still cannot distinguish some simple regular graphs. To further enhance its expressive power, we introduce a KP-GNN framework, which improves K-hop message passing by leveraging the peripheral subgraph information in each hop. We prove that KP-GNN can distinguish almost all regular graphs including some distance regular graphs which could not be distinguished by previous distance encoding methods. Experimental results verify the expressive power and effectiveness of KP-GNN. KP-GNN achieves competitive results across all benchmark datasets.
    Deep Active Learning with Noise Stability. (arXiv:2205.13340v1 [cs.LG])
    Uncertainty estimation for unlabeled data is crucial to active learning. With a deep neural network employed as the backbone model, the data selection process is highly challenging due to the potential over-confidence of the model inference. Existing methods resort to special learning fashions (e.g. adversarial) or auxiliary models to address this challenge. This tends to result in complex and inefficient pipelines, which would render the methods impractical. In this work, we propose a novel algorithm that leverages noise stability to estimate data uncertainty in a Single-Training Multi-Inference fashion. The key idea is to measure the output derivation from the original observation when the model parameters are randomly perturbed by noise. We provide theoretical analyses by leveraging the small Gaussian noise theory and demonstrate that our method favors a subset with large and diverse gradients. Despite its simplicity, our method outperforms the state-of-the-art active learning baselines in various tasks, including computer vision, natural language processing, and structural data analysis.
    QUICK-FL: Quick Unbiased Compression for Federated Learning. (arXiv:2205.13341v1 [cs.LG])
    Distributed Mean Estimation (DME) is a fundamental building block in communication efficient federated learning. In DME, clients communicate their lossily compressed gradients to the parameter server, which estimates the average and updates the model. State of the art DME techniques apply either unbiased quantization methods, resulting in large estimation errors, or biased quantization methods, where unbiasing the result requires that the server decodes each gradient individually, which markedly slows the aggregation time. In this paper, we propose QUIC-FL, a DME algorithm that achieves the best of all worlds. QUIC-FL is unbiased, offers fast aggregation time, and is competitive with the most accurate (slow aggregation) DME techniques. To achieve this, we formalize the problem in a novel way that allows us to use standard solvers to design near-optimal unbiased quantization schemes.
    Acute Lymphoblastic Leukemia Detection Using Hypercomplex-Valued Convolutional Neural Networks. (arXiv:2205.13273v1 [cs.CV])
    This paper features convolutional neural networks defined on hypercomplex algebras applied to classify lymphocytes in blood smear digital microscopic images. Such classification is helpful for the diagnosis of acute lymphoblast leukemia (ALL), a type of blood cancer. We perform the classification task using eight hypercomplex-valued convolutional neural networks (HvCNNs) along with real-valued convolutional networks. Our results show that HvCNNs perform better than the real-valued model, showcasing higher accuracy with a much smaller number of parameters. Moreover, we found that HvCNNs based on Clifford algebras processing HSV-encoded images attained the highest observed accuracies. Precisely, our HvCNN yielded an average accuracy rate of 96.6% using the ALL-IDB2 dataset with a 50% train-test split, a value extremely close to the state-of-the-art models but using a much simpler architecture with significantly fewer parameters.
    On the Eigenvalues of Global Covariance Pooling for Fine-grained Visual Recognition. (arXiv:2205.13282v1 [cs.CV])
    The Fine-Grained Visual Categorization (FGVC) is challenging because the subtle inter-class variations are difficult to be captured. One notable research line uses the Global Covariance Pooling (GCP) layer to learn powerful representations with second-order statistics, which can effectively model inter-class differences. In our previous conference paper, we show that truncating small eigenvalues of the GCP covariance can attain smoother gradient and improve the performance on large-scale benchmarks. However, on fine-grained datasets, truncating the small eigenvalues would make the model fail to converge. This observation contradicts the common assumption that the small eigenvalues merely correspond to the noisy and unimportant information. Consequently, ignoring them should have little influence on the performance. To diagnose this peculiar behavior, we propose two attribution methods whose visualizations demonstrate that the seemingly unimportant small eigenvalues are crucial as they are in charge of extracting the discriminative class-specific features. Inspired by this observation, we propose a network branch dedicated to magnifying the importance of small eigenvalues. Without introducing any additional parameters, this branch simply amplifies the small eigenvalues and achieves state-of-the-art performances of GCP methods on three fine-grained benchmarks. Furthermore, the performance is also competitive against other FGVC approaches on larger datasets. Code is available at \href{https://github.com/KingJamesSong/DifferentiableSVD}{https://github.com/KingJamesSong/DifferentiableSVD}.
    Fair Representation Learning through Implicit Path Alignment. (arXiv:2205.13316v1 [cs.LG])
    We consider a fair representation learning perspective, where optimal predictors, on top of the data representation, are ensured to be invariant with respect to different sub-groups. Specifically, we formulate this intuition as a bi-level optimization, where the representation is learned in the outer-loop, and invariant optimal group predictors are updated in the inner-loop. Moreover, the proposed bi-level objective is demonstrated to fulfill the sufficiency rule, which is desirable in various practical scenarios but was not commonly studied in the fair learning. Besides, to avoid the high computational and memory cost of differentiating in the inner-loop of bi-level objective, we propose an implicit path alignment algorithm, which only relies on the solution of inner optimization and the implicit differentiation rather than the exact optimization path. We further analyze the error gap of the implicit approach and empirically validate the proposed method in both classification and regression settings. Experimental results show the consistently better trade-off in prediction performance and fairness measurement.
    Towards Learning Universal Hyperparameter Optimizers with Transformers. (arXiv:2205.13320v1 [cs.LG])
    Meta-learning hyperparameter optimization (HPO) algorithms from prior experiments is a promising approach to improve optimization efficiency over objective functions from a similar distribution. However, existing methods are restricted to learning from experiments sharing the same set of hyperparameters. In this paper, we introduce the OptFormer, the first text-based Transformer HPO framework that provides a universal end-to-end interface for jointly learning policy and function prediction when trained on vast tuning data from the wild. Our extensive experiments demonstrate that the OptFormer can imitate at least 7 different HPO algorithms, which can be further improved via its function uncertainty estimates. Compared to a Gaussian Process, the OptFormer also learns a robust prior distribution for hyperparameter response functions, and can thereby provide more accurate and better calibrated predictions. This work paves the path to future extensions for training a Transformer-based model as a general HPO optimizer.
    Investigating classification learning curves for automatically generated and labelled plant images. (arXiv:2205.10955v2 [cs.LG] UPDATED)
    In the context of supervised machine learning a learning curve describes how a model's performance on unseen data relates to the amount of samples used to train the model. In this paper we present a dataset of plant images with representatives of crops and weeds common to the Manitoba prairies at different growth stages. We determine the learning curve for a classification task on this data with the ResNet architecture. Our results are in accordance with previous studies and add to the evidence that learning curves are governed by power-law relationships over large scales, applications, and models. We further investigate how label noise and the reduction of trainable parameters impacts the learning curve on this dataset. Both effects lead to the model requiring disproportionally larger training sets to achieve the same classification performance as observed without these effects.
    SARS-CoV-2 Result Interpretation based on Image Analysis of Lateral Flow Devices. (arXiv:2205.13311v1 [cs.LG])
    The widely used gene quantisation technique, Lateral Flow Device (LFD), is now commonly used to detect the presence of SARS-CoV-2. It is enabling the control and prevention of the spread of the virus. Depending on the viral load, LFD have different sensitivity and self-test for normal user present additional challenge to interpret the result. With the evolution of machine learning algorithms, image processing and analysis has seen unprecedented growth. In this interdisciplinary study, we employ novel image analysis methods of computer vision and machine learning field to study visual features of the control region of LFD. Here, we automatically derive results for any image containing LFD into positive, negative or inconclusive. This will reduce the burden of human involvement of health workers and perception bias.
    Triangular Contrastive Learning on Molecular Graphs. (arXiv:2205.13279v1 [cs.LG])
    Recent contrastive learning methods have shown to be effective in various tasks, learning generalizable representations invariant to data augmentation thereby leading to state of the art performances. Regarding the multifaceted nature of large unlabeled data used in self-supervised learning while majority of real-word downstream tasks use single format of data, a multimodal framework that can train single modality to learn diverse perspectives from other modalities is an important challenge. In this paper, we propose TriCL (Triangular Contrastive Learning), a universal framework for trimodal contrastive learning. TriCL takes advantage of Triangular Area Loss, a novel intermodal contrastive loss that learns the angular geometry of the embedding space through simultaneously contrasting the area of positive and negative triplets. Systematic observation on embedding space in terms of alignment and uniformity showed that Triangular Area Loss can address the line-collapsing problem by discriminating modalities by angle. Our experimental results also demonstrate the outperformance of TriCL on downstream task of molecular property prediction which implies that the advantages of the embedding space indeed benefits the performance on downstream tasks.
    Gaussian Universality of Linear Classifiers with Random Labels in High-Dimension. (arXiv:2205.13303v1 [stat.ML])
    While classical in many theoretical settings, the assumption of Gaussian i.i.d. inputs is often perceived as a strong limitation in the analysis of high-dimensional learning. In this study, we redeem this line of work in the case of generalized linear classification with random labels. Our main contribution is a rigorous proof that data coming from a range of generative models in high-dimensions have the same minimum training loss as Gaussian data with corresponding data covariance. In particular, our theorem covers data created by an arbitrary mixture of homogeneous Gaussian clouds, as well as multi-modal generative neural networks. In the limit of vanishing regularization, we further demonstrate that the training loss is independent of the data covariance. Finally, we show that this universality property is observed in practice with real datasets and random labels.
    Privacy-Preserving Wavelet Wavelet Neural Network with Fully Homomorphic Encryption. (arXiv:2205.13265v1 [cs.LG])
    The main aim of Privacy-Preserving Machine Learning (PPML) is to protect the privacy and provide security to the data used in building Machine Learning models. There are various techniques in PPML such as Secure Multi-Party Computation, Differential Privacy, and Homomorphic Encryption (HE). The techniques are combined with various Machine Learning models and even Deep Learning Networks to protect the data privacy as well as the identity of the user. In this paper, we propose a fully homomorphic encrypted wavelet neural network to protect privacy and at the same time not compromise on the efficiency of the model. We tested the effectiveness of the proposed method on seven datasets taken from the finance and healthcare domains. The results show that our proposed model performs similarly to the unencrypted model.
    SymNMF-Net for The Symmetric NMF Problem. (arXiv:2205.13214v1 [cs.LG])
    Recently, many works have demonstrated that Symmetric Non-negative Matrix Factorization~(SymNMF) enjoys a great superiority for various clustering tasks. Although the state-of-the-art algorithms for SymNMF perform well on synthetic data, they cannot consistently obtain satisfactory results with desirable properties and may fail on real-world tasks like clustering. Considering the flexibility and strong representation ability of the neural network, in this paper, we propose a neural network called SymNMF-Net for the Symmetric NMF problem to overcome the shortcomings of traditional optimization algorithms. Each block of SymNMF-Net is a differentiable architecture with an inversion layer, a linear layer and ReLU, which are inspired by a traditional update scheme for SymNMF. We show that the inference of each block corresponds to a single iteration of the optimization. Furthermore, we analyze the constraints of the inversion layer to ensure the output stability of the network to a certain extent. Empirical results on real-world datasets demonstrate the superiority of our SymNMF-Net and confirm the sufficiency of our theoretical analysis.
    Penalizing Proposals using Classifiers for Semi-Supervised Object Detection. (arXiv:2205.13219v1 [cs.CV])
    Obtaining gold standard annotated data for object detection is often costly, involving human-level effort. Semi-supervised object detection algorithms solve the problem with a small amount of gold-standard labels and a large unlabelled dataset used to generate silver-standard labels. But training on the silver standard labels does not produce good results, because they are machine-generated annotations. In this work, we design a modified loss function to train on large silver standard annotated sets generated by a weak annotator. We include a confidence metric associated with the annotation as an additional term in the loss function, signifying the quality of the annotation. We test the effectiveness of our approach on various test sets and use numerous variations to compare the results with some of the current approaches to object detection. In comparison with the baseline where no confidence metric is used, we achieved a 4\% gain in mAP with 25\% labeled data and 10\% gain in mAP with 50\% labeled data by using the proposed confidence metric.
    Evaluating Multimodal Interactive Agents. (arXiv:2205.13274v1 [cs.LG])
    Creating agents that can interact naturally with humans is a common goal in artificial intelligence (AI) research. However, evaluating these interactions is challenging: collecting online human-agent interactions is slow and expensive, yet faster proxy metrics often do not correlate well with interactive evaluation. In this paper, we assess the merits of these existing evaluation metrics and present a novel approach to evaluation called the Standardised Test Suite (STS). The STS uses behavioural scenarios mined from real human interaction data. Agents see replayed scenario context, receive an instruction, and are then given control to complete the interaction offline. These agent continuations are recorded and sent to human annotators to mark as success or failure, and agents are ranked according to the proportion of continuations in which they succeed. The resulting STS is fast, controlled, interpretable, and representative of naturalistic interactions. Altogether, the STS consolidates much of what is desirable across many of our standard evaluation metrics, allowing us to accelerate research progress towards producing agents that can interact naturally with humans. https://youtu.be/YR1TngGORGQ
    Federated Split BERT for Heterogeneous Text Classification. (arXiv:2205.13299v1 [cs.CL])
    Pre-trained BERT models have achieved impressive performance in many natural language processing (NLP) tasks. However, in many real-world situations, textual data are usually decentralized over many clients and unable to be uploaded to a central server due to privacy protection and regulations. Federated learning (FL) enables multiple clients collaboratively to train a global model while keeping the local data privacy. A few researches have investigated BERT in federated learning setting, but the problem of performance loss caused by heterogeneous (e.g., non-IID) data over clients remain under-explored. To address this issue, we propose a framework, FedSplitBERT, which handles heterogeneous data and decreases the communication cost by splitting the BERT encoder layers into local part and global part. The local part parameters are trained by the local client only while the global part parameters are trained by aggregating gradients of multiple clients. Due to the sheer size of BERT, we explore a quantization method to further reduce the communication cost with minimal performance loss. Our framework is ready-to-use and compatible to many existing federated learning algorithms, including FedAvg, FedProx and FedAdam. Our experiments verify the effectiveness of the proposed framework, which outperforms baseline methods by a significant margin, while FedSplitBERT with quantization can reduce the communication cost by $11.9\times$.
    The Effect of Task Ordering in Continual Learning. (arXiv:2205.13323v1 [cs.LG])
    We investigate the effect of task ordering on continual learning performance. We conduct an extensive series of empirical experiments on synthetic and naturalistic datasets and show that reordering tasks significantly affects the amount of catastrophic forgetting. Connecting to the field of curriculum learning, we show that the effect of task ordering can be exploited to modify continual learning performance, and present a simple approach for doing so. Our method computes the distance between all pairs of tasks, where distance is defined as the source task curvature of a gradient step toward the target task. Using statistically rigorous methods and sound experimental design, we show that task ordering is an important aspect of continual learning that can be modified for improved performance.
    DeepTechnome: Mitigating Unknown Bias in Deep Learning Based Assessment of CT Images. (arXiv:2205.13297v1 [eess.IV])
    Reliably detecting diseases using relevant biological information is crucial for real-world applicability of deep learning techniques in medical imaging. We debias deep learning models during training against unknown bias - without preprocessing/filtering the input beforehand or assuming specific knowledge about its distribution or precise nature in the dataset. We use control regions as surrogates that carry information regarding the bias, employ the classifier model to extract features, and suppress biased intermediate features with our custom, modular DecorreLayer. We evaluate our method on a dataset of 952 lung computed tomography scans by introducing simulated biases w.r.t. reconstruction kernel and noise level and propose including an adversarial test set in evaluations of bias reduction techniques. In a moderately sized model architecture, applying the proposed method to learn from data exhibiting a strong bias, it near-perfectly recovers the classification performance observed when training with corresponding unbiased data.
    DT-SV: A Transformer-based Time-domain Approach for Speaker Verification. (arXiv:2205.13249v1 [cs.SD])
    Speaker verification (SV) aims to determine whether the speaker's identity of a test utterance is the same as the reference speech. In the past few years, extracting speaker embeddings using deep neural networks for SV systems has gone mainstream. Recently, different attention mechanisms and Transformer networks have been explored widely in SV fields. However, utilizing the original Transformer in SV directly may have frame-level information waste on output features, which could lead to restrictions on capacity and discrimination of speaker embeddings. Therefore, we propose an approach to derive utterance-level speaker embeddings via a Transformer architecture that uses a novel loss function named diffluence loss to integrate the feature information of different Transformer layers. Therein, the diffluence loss aims to aggregate frame-level features into an utterance-level representation, and it could be integrated into the Transformer expediently. Besides, we also introduce a learnable mel-fbank energy feature extractor named time-domain feature extractor that computes the mel-fbank features more precisely and efficiently than the standard mel-fbank extractor. Combining Diffluence loss and Time-domain feature extractor, we propose a novel Transformer-based time-domain SV model (DT-SV) with faster training speed and higher accuracy. Experiments indicate that our proposed model can achieve better performance in comparison with other models.
    Collaborative Distillation Meta Learning for Simulation Intensive Hardware Design. (arXiv:2205.13225v1 [cs.LG])
    This paper proposes a novel collaborative distillation meta learning (CDML) framework for simulation intensive hardware design problems. Deep reinforcement learning (DRL) has shown promising performance in various hardware design problems. However, previous works on DRL-based hardware design only dealt with problems with simplified objectives, which are not practical. In fact, the objective evaluation of real-world electrical performance through simulation is costly in terms of both time and computation, making DRL scheme involving extensive reward calculations not suitable. In this paper, we apply the CDML framework to decoupling capacitor placement problem (DPP), one of the significant simulation intensive hardware design problems. The CDML framework consists of a context-based meta learner and collaborative distillation scheme to produce a reusable solver. The context-based meta learner captures the location of probing port (i.e., target circuit block) and improves generalization capability. The collaborative distillation scheme with equivariant label transformation imposes the action-permutation (AP)-equivariant nature of placement problems, which not only improves sample efficiency but also improves generalization capability. Extensive experimental results verified that our CDML outperforms both neural baselines and iterative conventional design methods in terms of real-world objective, power integrity, with zero-shot transfer-ability.
    A Model or 603 Exemplars: Towards Memory-Efficient Class-Incremental Learning. (arXiv:2205.13218v1 [cs.LG])
    Real-world applications require the classification model to adapt to new classes without forgetting old ones. Correspondingly, Class-Incremental Learning (CIL) aims to train a model with limited memory size to meet this requirement. Typical CIL methods tend to save representative exemplars from former classes to resist forgetting, while recent works find that storing models from history can substantially boost the performance. However, the stored models are not counted into the memory budget, which implicitly results in unfair comparisons. We find that when counting the model size into the total budget and comparing methods with aligned memory size, saving models do not consistently work, especially for the case with limited memory budgets. As a result, we need to holistically evaluate different CIL methods at different memory scales and simultaneously consider accuracy and memory size for measurement. On the other hand, we dive deeply into the construction of the memory buffer for memory efficiency. By analyzing the effect of different layers in the network, we find that shallow and deep layers have different characteristics in CIL. Motivated by this, we propose a simple yet effective baseline, denoted as MEMO for Memory-efficient Expandable MOdel. MEMO extends specialized layers based on the shared generalized representations, efficiently extracting diverse representations with modest cost and maintaining representative exemplars. Extensive experiments on benchmark datasets validate MEMO's competitive performance.
    Continual Feature Selection: Spurious Features in Continual Learning. (arXiv:2203.01012v2 [cs.LG] UPDATED)
    Continual Learning (CL) is the research field addressing learning without forgetting when the data distribution is not static. This paper studies spurious features' influence on continual learning algorithms. We show that continual learning algorithms solve tasks by selecting features that are not generalizable. Our experiments highlight that continual learning algorithms face two related problems: (1) spurious features and (2) local spurious features. The first one is due to a covariate shift between training and testing data, while the second is due to the limited access to data at each training step. We study (1) through a consistent set of continual learning experiments varying spurious correlation amount and data distribution support. We show that (2) is a major cause of performance decrease in continual learning along with catastrophic forgetting. This paper presents a different way of understanding performance decrease in continual learning by highlighting the influence of (local) spurious features in algorithms capabilities.
    QSpeech: Low-Qubit Quantum Speech Application Toolkit. (arXiv:2205.13221v1 [quant-ph])
    Quantum devices with low qubits are common in the Noisy Intermediate-Scale Quantum (NISQ) era. However, Quantum Neural Network (QNN) running on low-qubit quantum devices would be difficult since it is based on Variational Quantum Circuit (VQC), which requires many qubits. Therefore, it is critical to make QNN with VQC run on low-qubit quantum devices. In this study, we propose a novel VQC called the low-qubit VQC. VQC requires numerous qubits based on the input dimension; however, the low-qubit VQC with linear transformation can liberate this condition. Thus, it allows the QNN to run on low-qubit quantum devices for speech applications. Furthermore, as compared to the VQC, our proposed low-qubit VQC can stabilize the training process more. Based on the low-qubit VQC, we implement QSpeech, a library for quick prototyping of hybrid quantum-classical neural networks in the speech field. It has numerous quantum neural layers and QNN models for speech applications. Experiments on Speech Command Recognition and Text-to-Speech show that our proposed low-qubit VQC outperforms VQC and is more stable.
    Contrastive and Non-Contrastive Self-Supervised Learning Recover Global and Local Spectral Embedding Methods. (arXiv:2205.11508v2 [cs.LG] UPDATED)
    Self-Supervised Learning (SSL) surmises that inputs and pairwise positive relationships are enough to learn meaningful representations. Although SSL has recently reached a milestone: outperforming supervised methods in many modalities\dots the theoretical foundations are limited, method-specific, and fail to provide principled design guidelines to practitioners. In this paper, we propose a unifying framework under the helm of spectral manifold learning to address those limitations. Through the course of this study, we will rigorously demonstrate that VICReg, SimCLR, BarlowTwins et al. correspond to eponymous spectral methods such as Laplacian Eigenmaps, Multidimensional Scaling et al. This unification will then allow us to obtain (i) the closed-form optimal representation for each method, (ii) the closed-form optimal network parameters in the linear regime for each method, (iii) the impact of the pairwise relations used during training on each of those quantities and on downstream task performances, and most importantly, (iv) the first theoretical bridge between contrastive and non-contrastive methods towards global and local spectral embedding methods respectively, hinting at the benefits and limitations of each. For example, (i) if the pairwise relation is aligned with the downstream task, any SSL method can be employed successfully and will recover the supervised method, but in the low data regime, VICReg's invariance hyper-parameter should be high; (ii) if the pairwise relation is misaligned with the downstream task, VICReg with small invariance hyper-parameter should be preferred over SimCLR or BarlowTwins.
    Mask-based Latent Reconstruction for Reinforcement Learning. (arXiv:2201.12096v2 [cs.LG] UPDATED)
    For deep reinforcement learning (RL) from pixels, learning effective state representations is crucial for achieving high performance. However, in practice, limited experience and high-dimensional input prevent effective representation learning. To address this, motivated by the success of masked modeling in other research fields, we introduce mask-based reconstruction to promote state representation learning in RL. Specifically, we propose a simple yet effective self-supervised method, Mask-based Latent Reconstruction (MLR), to predict the complete state representations in the latent space from the observations with spatially and temporally masked pixels. MLR enables the better use of context information when learning state representations to make them more informative, which facilitates RL agent training. Extensive experiments show that our MLR significantly improves the sample efficiency in RL and outperforms the state-of-the-art sample-efficient RL methods on multiple continuous and discrete control benchmarks. The code will be released soon.
    The Shapley Value in Machine Learning. (arXiv:2202.05594v2 [cs.LG] UPDATED)
    Over the last few years, the Shapley value, a solution concept from cooperative game theory, has found numerous applications in machine learning. In this paper, we first discuss fundamental concepts of cooperative game theory and axiomatic properties of the Shapley value. Then we give an overview of the most important applications of the Shapley value in machine learning: feature selection, explainability, multi-agent reinforcement learning, ensemble pruning, and data valuation. We examine the most crucial limitations of the Shapley value and point out directions for future research.
    Denial-of-Service Attacks on Learned Image Compression. (arXiv:2205.13253v1 [cs.CV])
    Deep learning techniques have shown promising results in image compression, with competitive bitrate and image reconstruction quality from compressed latent. However, while image compression has progressed towards higher peak signal-to-noise ratio (PSNR) and fewer bits per pixel (bpp), their robustness to corner-case images has never received deliberation. In this work, we, for the first time, investigate the robustness of image compression systems where imperceptible perturbation of input images can precipitate a significant increase in the bitrate of their compressed latent. To characterize the robustness of state-of-the-art learned image compression, we mount white and black-box attacks. Our results on several image compression models with various bitrate qualities show that they are surprisingly fragile, where the white-box attack achieves up to 56.326x and black-box 1.947x bpp change. To improve robustness, we propose a novel model which incorporates attention modules and a basic factorized entropy model, resulting in a promising trade-off between the PSNR/bpp ratio and robustness to adversarial attacks that surpasses existing learned image compressors.
    Active Labeling: Streaming Stochastic Gradients. (arXiv:2205.13255v1 [cs.LG])
    The workhorse of machine learning is stochastic gradient descent. To access stochastic gradients, it is common to consider iteratively input/output pairs of a training dataset. Interestingly, it appears that one does not need full supervision to access stochastic gradients, which is the main motivation of this paper. After formalizing the "active labeling" problem, which generalizes active learning based on partial supervision, we provide a streaming technique that provably minimizes the ratio of generalization error over number of samples. We illustrate our technique in depth for robust regression.
    Trajectory-Constrained Deep Latent Visual Attention for Improved Local Planning in Presence of Heterogeneous Terrain. (arXiv:2112.04684v3 [cs.RO] UPDATED)
    We present a reward-predictive, model-based deep learning method featuring trajectory-constrained visual attention for local planning in visual navigation tasks. Our method learns to place visual attention at locations in latent image space which follow trajectories caused by vehicle control actions to enhance predictive accuracy during planning. The attention model is jointly optimized by the task-specific loss and an additional trajectory-constraint loss, allowing adaptability yet encouraging a regularized structure for improved generalization and reliability. Importantly, visual attention is applied in latent feature map space instead of raw image space to promote efficient planning. We validated our model in visual navigation tasks of planning low turbulence, collision-free trajectories in off-road settings and hill climbing with locking differentials in the presence of slippery terrain. Experiments involved randomized procedural generated simulation and real-world environments. We found our method improved generalization and learning efficiency when compared to no-attention and self-attention alternatives.
    Orthogonal Stochastic Configuration Networks with Adaptive Construction Parameter for Data Analytics. (arXiv:2205.13191v1 [cs.LG])
    As a randomized learner model, SCNs are remarkable that the random weights and biases are assigned employing a supervisory mechanism to ensure universal approximation and fast learning. However, the randomness makes SCNs more likely to generate approximate linear correlative nodes that are redundant and low quality, thereby resulting in non-compact network structure. In the light of a fundamental principle in machine learning, that is, a model with fewer parameters holds improved generalization. This paper proposes orthogonal SCN, termed OSCN, to filtrate out the low-quality hidden nodes for network structure reduction by incorporating Gram-Schmidt orthogonalization technology. The universal approximation property of OSCN and an adaptive setting for the key construction parameters have been presented in details. In addition, an incremental updating scheme is developed to dynamically determine the output weights, contributing to improved computational efficiency. Finally, experimental results on two numerical examples and several real-world regression and classification datasets substantiate the effectiveness and feasibility of the proposed approach.
    Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers. (arXiv:2107.03996v3 [cs.LG] UPDATED)
    We propose to address quadrupedal locomotion tasks using Reinforcement Learning (RL) with a Transformer-based model that learns to combine proprioceptive information and high-dimensional depth sensor inputs. While learning-based locomotion has made great advances using RL, most methods still rely on domain randomization for training blind agents that generalize to challenging terrains. Our key insight is that proprioceptive states only offer contact measurements for immediate reaction, whereas an agent equipped with visual sensory observations can learn to proactively maneuver environments with obstacles and uneven terrain by anticipating changes in the environment many steps ahead. In this paper, we introduce LocoTransformer, an end-to-end RL method that leverages both proprioceptive states and visual observations for locomotion control. We evaluate our method in challenging simulated environments with different obstacles and uneven terrain. We transfer our learned policy from simulation to a real robot by running it indoors and in the wild with unseen obstacles and terrain. Our method not only significantly improves over baselines, but also achieves far better generalization performance, especially when transferred to the real robot. Our project page with videos is at https://rchalyang.github.io/LocoTransformer/ .
    $O(N^2)$ Universal Antisymmetry in Fermionic Neural Networks. (arXiv:2205.13205v1 [cs.LG])
    Fermionic neural network (FermiNet) is a recently proposed wavefunction Ansatz, which is used in variational Monte Carlo (VMC) methods to solve the many-electron Schr\"odinger equation. FermiNet proposes permutation-equivariant architectures, on which a Slater determinant is applied to induce antisymmetry. FermiNet is proved to have universal approximation capability with a single determinant, namely, it suffices to represent any antisymmetric function given sufficient parameters. However, the asymptotic computational bottleneck comes from the Slater determinant, which scales with $O(N^3)$ for $N$ electrons. In this paper, we substitute the Slater determinant with a pairwise antisymmetry construction, which is easy to implement and can reduce the computational cost to $O(N^2)$. Furthermore, we formally prove that the pairwise construction built upon permutation-equivariant architectures can universally represent any antisymmetric function.
    On Learning Mixture of Linear Regressions in the Non-Realizable Setting. (arXiv:2205.13166v1 [stat.ML])
    While mixture of linear regressions (MLR) is a well-studied topic, prior works usually do not analyze such models for prediction error. In fact, {\em prediction} and {\em loss} are not well-defined in the context of mixtures. In this paper, first we show that MLR can be used for prediction where instead of predicting a label, the model predicts a list of values (also known as {\em list-decoding}). The list size is equal to the number of components in the mixture, and the loss function is defined to be minimum among the losses resulted by all the component models. We show that with this definition, a solution of the empirical risk minimization (ERM) achieves small probability of prediction error. This begs for an algorithm to minimize the empirical risk for MLR, which is known to be computationally hard. Prior algorithmic works in MLR focus on the {\em realizable} setting, i.e., recovery of parameters when data is probabilistically generated by a mixed linear (noisy) model. In this paper we show that a version of the popular alternating minimization (AM) algorithm finds the best fit lines in a dataset even when a realizable model is not assumed, under some regularity conditions on the dataset and the initial points, and thereby provides a solution for the ERM. We further provide an algorithm that runs in polynomial time in the number of datapoints, and recovers a good approximation of the best fit lines. The two algorithms are experimentally compared.
    AI for Porosity and Permeability Prediction from Geologic Core X-Ray Micro-Tomography. (arXiv:2205.13189v1 [cs.LG])
    Geologic cores are rock samples that are extracted from deep under the ground during the well drilling process. They are used for petroleum reservoirs' performance characterization. Traditionally, physical studies of cores are carried out by the means of manual time-consuming experiments. With the development of deep learning, scientists actively started working on developing machine-learning-based approaches to identify physical properties without any manual experiments. Several previous works used machine learning to determine the porosity and permeability of the rocks, but either method was inaccurate or computationally expensive. We are proposing to use self-supervised pretraining of the very small CNN-transformer-based model to predict the physical properties of the rocks with high accuracy in a time-efficient manner. We show that this technique prevents overfitting even for extremely small datasets.
    RENs: Relevance Encoding Networks. (arXiv:2205.13061v1 [cs.LG])
    The manifold assumption for high-dimensional data assumes that the data is generated by varying a set of parameters obtained from a low-dimensional latent space. Deep generative models (DGMs) are widely used to learn data representations in an unsupervised way. DGMs parameterize the underlying low-dimensional manifold in the data space using bottleneck architectures such as variational autoencoders (VAEs). The bottleneck dimension for VAEs is treated as a hyperparameter that depends on the dataset and is fixed at design time after extensive tuning. As the intrinsic dimensionality of most real-world datasets is unknown, often, there is a mismatch between the intrinsic dimensionality and the latent dimensionality chosen as a hyperparameter. This mismatch can negatively contribute to the model performance for representation learning and sample generation tasks. This paper proposes relevance encoding networks (RENs): a novel probabilistic VAE-based framework that uses the automatic relevance determination (ARD) prior in the latent space to learn the data-specific bottleneck dimensionality. The relevance of each latent dimension is directly learned from the data along with the other model parameters using stochastic gradient descent and a reparameterization trick adapted to non-Gaussian priors. We leverage the concept of DeepSets to capture permutation invariant statistical properties in both data and latent spaces for relevance determination. The proposed framework is general and flexible and can be used for the state-of-the-art VAE models that leverage regularizers to impose specific characteristics in the latent space (e.g., disentanglement). With extensive experimentation on synthetic and public image datasets, we show that the proposed model learns the relevant latent bottleneck dimensionality without compromising the representation and generation quality of the samples.
    Evolutionary scheduling of university activities based on consumption forecasts to minimise electricity costs. (arXiv:2202.12595v2 [cs.LG] UPDATED)
    This paper presents a solution to a predict then optimise problem which goal is to reduce the electricity cost of a university campus. The proposed methodology combines a multi-dimensional time series forecast and a novel approach to large-scale optimization. Gradient-boosting method is applied to forecast both generation and consumption time-series of the Monash university campus for the month of November 2020. For the consumption forecasts we employ log transformation to model trend and stabilize variance. Additional seasonality and trend features are added to the model inputs when applicable. The forecasts obtained are used as the base load for the schedule optimisation of university activities and battery usage. The goal of the optimisation is to minimize the electricity cost consisting of the price of electricity and the peak electricity tariff both altered by the load from class activities and battery use as well as the penalty of not scheduling some optional activities. The schedule of the class activities is obtained through evolutionary optimisation using the covariance matrix adaptation evolution strategy and the genetic algorithm. This schedule is then improved through local search by testing possible times for each activity one-by-one. The battery schedule is formulated as a mixed-integer programming problem and solved by the Gurobi solver. This method obtains the second lowest cost when evaluated against 6 other methods presented at an IEEE competition that all used mixed-integer programming and the Gurobi solver to schedule both the activities and the battery use. The code and data used for the paper are publicly available.
    Symbolic Physics Learner: Discovering governing equations via Monte Carlo tree search. (arXiv:2205.13134v1 [cs.AI])
    Nonlinear dynamics is ubiquitous in nature and commonly seen in various science and engineering disciplines. Distilling analytical expressions that govern nonlinear dynamics from limited data remains vital but challenging. To tackle this fundamental issue, we propose a novel Symbolic Physics Learner (SPL) machine to discover the mathematical structure of nonlinear dynamics. The key concept is to interpret mathematical operations and system state variables by computational rules and symbols, establish symbolic reasoning of mathematical formulas via expression trees, and employ a Monte Carlo tree search (MCTS) agent to explore optimal expression trees based on measurement data. The MCTS agent obtains an optimistic selection policy through the traversal of expression trees, featuring the one that maps to the arithmetic expression of underlying physics. Salient features of the proposed framework include search flexibility and enforcement of parsimony for discovered equations. The efficacy and superiority of the PSL machine are demonstrated by numerical examples, compared with state-of-the-art baselines.
    On the Evolution of A.I. and Machine Learning: Towards Measuring and Understanding Impact, Influence, and Leadership at Premier A.I. Conferences. (arXiv:2205.13131v1 [cs.AI])
    Artificial Intelligence is now recognized as a general-purpose technology with ample impact on human life. In this work, we aim to understand the evolution of AI and Machine learning over the years by analyzing researchers' impact, influence, and leadership over the last decades. This work also intends to shed new light on the history and evolution of AI by exploring the dynamics involved in the field's evolution through the lenses of the papers published on AI conferences since the first International Joint Conference on Artificial Intelligence (IJCAI) in 1969. AI development and evolution have led to increasing research output, reflected in the number of articles published over the last sixty years. We construct comprehensive citation-collaboration and paper-author datasets and compute corresponding centrality measures to carry out our analyses. These analyses allow a better understanding of how AI has reached its current state of affairs in research. Throughout the process, we correlate these datasets with the work of the ACM Turing Award winners and the so-called two AI winters the field has gone through. We also look at self-citation trends and new authors' behaviors. Finally, we present a novel way to infer the country of affiliation of a paper from its organization. Therefore, this work provides a deep analysis of Artificial Intelligence history from information gathered and analyzed from large technical venues datasets and suggests novel insights that can contribute to understanding and measuring AI's evolution.
    Cost-efficient Gaussian Tensor Network Embeddings for Tensor-structured Inputs. (arXiv:2205.13163v1 [math.NA])
    This work discusses tensor network embeddings, which are random matrices ($S$) with tensor network structure. These embeddings have been used to perform dimensionality reduction of tensor network structured inputs $x$ and accelerate applications such as tensor decomposition and kernel regression. Existing works have designed embeddings for inputs $x$ with specific structures, such that the computational cost for calculating $Sx$ is efficient. We provide a systematic way to design tensor network embeddings consisting of Gaussian random tensors, such that for inputs with more general tensor network structures, both the sketch size (row size of $S$) and the sketching computational cost are low. We analyze general tensor network embeddings that can be reduced to a sequence of sketching matrices. We provide a sufficient condition to quantify the accuracy of such embeddings and derive sketching asymptotic cost lower bounds using embeddings that satisfy this condition and have a sketch size lower than any input dimension. We then provide an algorithm to efficiently sketch input data using such embeddings. The sketch size of the embedding used in the algorithm has a linear dependence on the number of sketching dimensions of the input. Assuming tensor contractions are performed with classical dense matrix multiplication algorithms, this algorithm achieves asymptotic cost within a factor of $O(\sqrt{m})$ of our cost lower bound, where $m$ is the sketch size. Further, when each tensor in the input has a dimension that needs to be sketched, this algorithm yields the optimal sketching asymptotic cost. We apply our sketching analysis to inexact tensor decomposition optimization algorithms. We provide a sketching algorithm for CP decomposition that is asymptotically faster than existing work in multiple regimes, and show optimality of an existing algorithm for tensor train rounding.
    Reliably-stabilizing piecewise-affine neural network controllers. (arXiv:2111.07183v3 [eess.SY] UPDATED)
    A common problem affecting neural network (NN) approximations of model predictive control (MPC) policies is the lack of analytical tools to assess the stability of the closed-loop system under the action of the NN-based controller. We present a general procedure to quantify the performance of such a controller, or to design minimum complexity NNs with rectified linear units (ReLUs) that preserve the desirable properties of a given MPC scheme. By quantifying the approximation error between NN-based and MPC-based state-to-input mappings, we first establish suitable conditions involving two key quantities, the worst-case error and the Lipschitz constant, guaranteeing the stability of the closed-loop system. We then develop an offline, mixed-integer optimization-based method to compute those quantities exactly. Together these techniques provide conditions sufficient to certify the stability and performance of a ReLU-based approximation of an MPC control law.
    The Neural Testbed: Evaluating Joint Predictions. (arXiv:2110.04629v3 [cs.LG] UPDATED)
    Predictive distributions quantify uncertainties ignored by point estimates. This paper introduces \textit{The Neural Testbed}: an open-source benchmark for controlled and principled evaluation of agents that generate such predictions. Crucially, the testbed assesses agents not only on the quality of their marginal predictions per input, but also on their joint predictions across many inputs. We evaluate a range of agents using a simple neural network data generating process. Our results indicate that some popular Bayesian deep learning agents do not fare well with joint predictions, even when they can produce accurate marginal predictions. We also show that the quality of joint predictions drives performance in downstream decision tasks. We find these results are robust across choice a wide range of generative models, and highlight the practical importance of joint predictions to the community.
    Deep Generative Modeling for Volume Reconstruction in Cryo-Electron Microscopy. (arXiv:2201.02867v3 [eess.IV] UPDATED)
    Recent breakthroughs in high-resolution imaging of biomolecules in solution with cryo-electron microscopy (cryo-EM) have unlocked new doors for the reconstruction of molecular volumes, thereby promising further advances in biology, chemistry, and pharmacological research. Recent next-generation volume reconstruction algorithms that combine generative modeling with end-to-end unsupervised deep learning techniques have shown promising preliminary results, but still face considerable technical and theoretical hurdles when applied to experimental cryo-EM images. In light of the proliferation of such methods, we propose here a critical review of recent advances in the field of deep generative modeling for cryo-EM volume reconstruction. The present review aims to (i) unify and compare these new methods using a consistent statistical framework, (ii) present them using a terminology familiar to machine learning researchers and computational biologists with no specific background in cryo-EM, and (iii) provide the necessary perspective on current advances to highlight their relative strengths and weaknesses, along with outstanding bottlenecks and avenues for improvements in the field. This review might also raise the interest of computer vision practitioners, as it highlights significant limits of deep generative models in low signal-to-noise regimes -- therefore emphasizing a need for new theoretical and methodological developments.
    Efficient and Near-Optimal Smoothed Online Learning for Generalized Linear Functions. (arXiv:2205.13056v1 [stat.ML])
    Due to the drastic gap in complexity between sequential and batch statistical learning, recent work has studied a smoothed sequential learning setting, where Nature is constrained to select contexts with density bounded by 1/{\sigma} with respect to a known measure {\mu}. Unfortunately, for some function classes, there is an exponential gap between the statistically optimal regret and that which can be achieved efficiently. In this paper, we give a computationally efficient algorithm that is the first to enjoy the statistically optimal log(T/{\sigma}) regret for realizable K-wise linear classification. We extend our results to settings where the true classifier is linear in an over-parameterized polynomial featurization of the contexts, as well as to a realizable piecewise-regression setting assuming access to an appropriate ERM oracle. Somewhat surprisingly, standard disagreement-based analyses are insufficient to achieve regret logarithmic in 1/{\sigma}. Instead, we develop a novel characterization of the geometry of the disagreement region induced by generalized linear classifiers. Along the way, we develop numerous technical tools of independent interest, including a general anti-concentration bound for the determinant of certain matrix averages.
    Matryoshka Representations for Adaptive Deployment. (arXiv:2205.13147v1 [cs.LG])
    Learned representations are a central component in modern ML systems, serving a multitude of downstream tasks. When training such representations, it is often the case that computational and statistical constraints for each downstream task are unknown. In this context rigid, fixed capacity representations can be either over or under-accommodating to the task at hand. This leads us to ask: can we design a flexible representation that can adapt to multiple downstream tasks with varying computational resources? Our main contribution is Matryoshka Representation Learning (MRL) which encodes information at different granularities and allows a single embedding to adapt to the computational constraints of downstream tasks. MRL minimally modifies existing representation learning pipelines and imposes no additional cost during inference and deployment. MRL learns coarse-to-fine representations that are at least as accurate and rich as independently trained low-dimensional representations. The flexibility within the learned Matryoshka Representations offer: (a) up to 14x smaller embedding size for ImageNet-1K classification at the same level of accuracy; (b) up to 14x real-world speed-ups for large-scale retrieval on ImageNet-1K and 4K; and (c) up to 2% accuracy improvements for long-tail few-shot classification, all while being as robust as the original representations. Finally, we show that MRL extends seamlessly to web-scale datasets (ImageNet, JFT) across various modalities -- vision (ViT, ResNet), vision + language (ALIGN) and language (BERT). MRL code and pretrained models are open-sourced at https://github.com/RAIVNLab/MRL.
    Mitigating Memorization of Noisy Labels via Regularization between Representations. (arXiv:2110.09022v3 [cs.LG] UPDATED)
    Designing robust loss functions is popular in learning with noisy labels while existing designs did not explicitly consider the overfitting property of deep neural networks (DNNs). As a result, applying these losses may still suffer from overfitting/memorizing noisy labels as training proceeds. In this paper, we first theoretically analyze the memorization effect and show that a lower-capacity model may perform better on noisy datasets. However, it is non-trivial to design a neural network with the best capacity given an arbitrary task. To circumvent this dilemma, instead of changing the model architecture, we decouple DNNs into an encoder followed by a linear classifier and propose to restrict the function space of a DNN by a representation regularizer. Particularly, we require the distance between two self-supervised features to be positively related to the distance between the corresponding two supervised model outputs. Our proposed framework is easily extendable and can incorporate many other robust loss functions to further improve performance. Extensive experiments and theoretical analyses support our claims. Code is available at github.com/UCSC-REAL/SelfSup_NoisyLabel.
    Joint Synthesis of Safety Certificate and Safe Control Policy using Constrained Reinforcement Learning. (arXiv:2111.07695v3 [cs.LG] UPDATED)
    Safety is the major consideration in controlling complex dynamical systems using reinforcement learning (RL), where the safety certificate can provide provable safety guarantee. A valid safety certificate is an energy function indicating that safe states are with low energy, and there exists a corresponding safe control policy that allows the energy function to always dissipate. The safety certificate and the safe control policy are closely related to each other and both challenging to synthesize. Therefore, existing learning-based studies treat either of them as prior knowledge to learn the other, which limits their applicability with general unknown dynamics. This paper proposes a novel approach that simultaneously synthesizes the energy-function-based safety certificate and learns the safe control policy with CRL. We do not rely on prior knowledge about either an available model-based controller or a perfect safety certificate. In particular, we formulate a loss function to optimize the safety certificate parameters by minimizing the occurrence of energy increases. By adding this optimization procedure as an outer loop to the Lagrangian-based constrained reinforcement learning (CRL), we jointly update the policy and safety certificate parameters and prove that they will converge to their respective local optima, the optimal safe policy and a valid safety certificate. We evaluate our algorithms on multiple safety-critical benchmark environments. The results show that the proposed algorithm learns provably safe policies with no constraint violation. The validity or feasibility of synthesized safety certificate is also verified numerically.
    GraphPMU: Event Clustering via Graph Representation Learning Using Locationally-Scarce Distribution-Level Fundamental and Harmonic PMU Measurements. (arXiv:2205.13116v1 [cs.LG])
    This paper is concerned with the complex task of identifying the type and cause of the events that are captured by distribution-level phasor measurement units (D-PMUs) in order to enhance situational awareness in power distribution systems. Our goal is to address two fundamental challenges in this field: a) scarcity in measurement locations due to the high cost of purchasing, installing, and streaming data from D-PMUs; b) limited prior knowledge about the event signatures due to the fact that the events are diverse, infrequent, and inherently unscheduled. To tackle these challenges, we propose an unsupervised graph-representation learning method, called GraphPMU, to significantly improve the performance in event clustering under locationally-scarce data availability by proposing the following two new directions: 1) using the topological information about the relative location of the few available phasor measurement units on the graph of the power distribution network; 2) utilizing not only the commonly used fundamental phasor measurements, bus also the less explored harmonic phasor measurements in the process of analyzing the signatures of various events. Through a detailed analysis of several case studies, we show that GraphPMU can highly outperform the prevalent methods in the literature.
    Near-Optimal Goal-Oriented Reinforcement Learning in Non-Stationary Environments. (arXiv:2205.13044v1 [cs.LG])
    We initiate the study of dynamic regret minimization for goal-oriented reinforcement learning modeled by a non-stationary stochastic shortest path problem with changing cost and transition functions. We start by establishing a lower bound $\Omega((B_{\star} SAT_{\star}(\Delta_c + B_{\star}^2\Delta_P))^{1/3}K^{2/3})$, where $B_{\star}$ is the maximum expected cost of the optimal policy of any episode starting from any state, $T_{\star}$ is the maximum hitting time of the optimal policy of any episode starting from the initial state, $SA$ is the number of state-action pairs, $\Delta_c$ and $\Delta_P$ are the amount of changes of the cost and transition functions respectively, and $K$ is the number of episodes. The different roles of $\Delta_c$ and $\Delta_P$ in this lower bound inspire us to design algorithms that estimate costs and transitions separately. Specifically, assuming the knowledge of $\Delta_c$ and $\Delta_P$, we develop a simple but sub-optimal algorithm and another more involved minimax optimal algorithm (up to logarithmic terms). These algorithms combine the ideas of finite-horizon approximation [Chen et al., 2022a], special Bernstein-style bonuses of the MVP algorithm [Zhang et al., 2020], adaptive confidence widening [Wei and Luo, 2021], as well as some new techniques such as properly penalizing long-horizon policies. Finally, when $\Delta_c$ and $\Delta_P$ are unknown, we develop a variant of the MASTER algorithm [Wei and Luo, 2021] and integrate the aforementioned ideas into it to achieve $\widetilde{O}(\min\{B_{\star} S\sqrt{ALK}, (B_{\star}^2S^2AT_{\star}(\Delta_c+B_{\star}\Delta_P))^{1/3}K^{2/3}\})$ regret, where $L$ is the unknown number of changes of the environment.
    Understanding Metrics for Paraphrasing. (arXiv:2205.13119v1 [cs.CL])
    Paraphrase generation is a difficult problem. This is not only because of the limitations in text generation capabilities but also due that to the lack of a proper definition of what qualifies as a paraphrase and corresponding metrics to measure how good it is. Metrics for evaluation of paraphrasing quality is an on going research problem. Most of the existing metrics in use having been borrowed from other tasks do not capture the complete essence of a good paraphrase, and often fail at borderline-cases. In this work, we propose a novel metric $ROUGE_P$ to measure the quality of paraphrases along the dimensions of adequacy, novelty and fluency. We also provide empirical evidence to show that the current natural language generation metrics are insufficient to measure these desired properties of a good paraphrase. We look at paraphrase model fine-tuning and generation from the lens of metrics to gain a deeper understanding of what it takes to generate and evaluate a good paraphrase.
    A Penalized Shared-parameter Algorithm for Estimating Optimal Dynamic Treatment Regimens. (arXiv:2107.07875v2 [stat.ML] UPDATED)
    A dynamic treatment regimen (DTR) is a set of decision rules to personalize treatments for an individual using their medical history. The Q-learning based Q-shared algorithm has been used to develop DTRs that involve decision rules shared across multiple stages of intervention. We show that the existing Q-shared algorithm can suffer from non-convergence due to the use of linear models in the Q-learning setup, and identify the condition in which Q-shared fails. Leveraging properties from expansion-constrained ordinary least-squares, we give a penalized Q-shared algorithm that not only converges in settings that violate the condition, but can outperform the original Q-shared algorithm even when the condition is satisfied. We give evidence for the proposed method in a real-world application and several synthetic simulations.
    Identifying Patient-Specific Root Causes with the Heteroscedastic Noise Model. (arXiv:2205.13085v1 [stat.ML])
    Complex diseases are caused by a multitude of factors that may differ between patients even within the same diagnostic category. A few underlying root causes may nevertheless initiate the development of disease within each patient. We therefore focus on identifying patient-specific root causes of disease, which we equate to the sample-specific predictivity of the exogenous error terms in a structural equation model. We generalize from the linear setting to the heteroscedastic noise model where $Y = m(X) + \varepsilon\sigma(X)$ with non-linear functions $m(X)$ and $\sigma(X)$ representing the conditional mean and mean absolute deviation, respectively. This model preserves identifiability but introduces non-trivial challenges that require a customized algorithm called Generalized Root Causal Inference (GRCI) to extract the error terms correctly. GRCI recovers patient-specific root causes more accurately than existing alternatives.
    Optimal Neural Network Approximation of Wasserstein Gradient Direction via Convex Optimization. (arXiv:2205.13098v1 [cs.LG])
    The computation of Wasserstein gradient direction is essential for posterior sampling problems and scientific computing. The approximation of the Wasserstein gradient with finite samples requires solving a variational problem. We study the variational problem in the family of two-layer networks with squared-ReLU activations, towards which we derive a semi-definite programming (SDP) relaxation. This SDP can be viewed as an approximation of the Wasserstein gradient in a broader function family including two-layer networks. By solving the convex SDP, we obtain the optimal approximation of the Wasserstein gradient direction in this class of functions. Numerical experiments including PDE-constrained Bayesian inference and parameter estimation in COVID-19 modeling demonstrate the effectiveness of the proposed method.
    Cali3F: Calibrated Fast Fair Federated Recommendation System. (arXiv:2205.13121v1 [cs.IR])
    The increasingly stringent regulations on privacy protection have sparked interest in federated learning. As a distributed machine learning framework, it bridges isolated data islands by training a global model over devices while keeping data localized. Specific to recommendation systems, many federated recommendation algorithms have been proposed to realize the privacy-preserving collaborative recommendation. However, several constraints remain largely unexplored. One big concern is how to ensure fairness between participants of federated learning, that is, to maintain the uniformity of recommendation performance across devices. On the other hand, due to data heterogeneity and limited networks, additional challenges occur in the convergence speed. To address these problems, in this paper, we first propose a personalized federated recommendation system training algorithm to improve the recommendation performance fairness. Then we adopt a clustering-based aggregation method to accelerate the training process. Combining the two components, we proposed Cali3F, a calibrated fast and fair federated recommendation framework. Cali3F not only addresses the convergence problem by a within-cluster parameter sharing approach but also significantly boosts fairness by calibrating local models with the global model. We demonstrate the performance of Cali3F across standard benchmark datasets and explore the efficacy in comparison to traditional aggregation approaches.
    QGNN: Value Function Factorisation with Graph Neural Networks. (arXiv:2205.13005v1 [cs.LG])
    In multi-agent reinforcement learning, the use of a global objective is a powerful tool for incentivising cooperation. Unfortunately, it is not sample-efficient to train individual agents with a global reward, because it does not necessarily correlate with an agent's individual actions. This problem can be solved by factorising the global value function into local value functions. Early work in this domain performed factorisation by conditioning local value functions purely on local information. Recently, it has been shown that providing both local information and an encoding of the global state can promote cooperative behaviour. In this paper we propose QGNN, the first value factorisation method to use a graph neural network (GNN) based model. The multi-layer message passing architecture of QGNN provides more representational complexity than models in prior work, allowing it to produce a more effective factorisation. QGNN also introduces a permutation invariant mixer which is able to match the performance of other methods, even with significantly fewer parameters. We evaluate our method against several baselines, including QMIX-Att, GraphMIX, QMIX, VDN, and hybrid architectures. Our experiments include Starcraft, the standard benchmark for credit assignment; Estimate Game, a custom environment that explicitly models inter-agent dependencies; and Coalition Structure Generation, a foundational problem with real-world applications. The results show that QGNN outperforms state-of-the-art value factorisation baselines consistently.
    Urban Rhapsody: Large-scale exploration of urban soundscapes. (arXiv:2205.13064v1 [cs.CY])
    Noise is one of the primary quality-of-life issues in urban environments. In addition to annoyance, noise negatively impacts public health and educational performance. While low-cost sensors can be deployed to monitor ambient noise levels at high temporal resolutions, the amount of data they produce and the complexity of these data pose significant analytical challenges. One way to address these challenges is through machine listening techniques, which are used to extract features in attempts to classify the source of noise and understand temporal patterns of a city's noise situation. However, the overwhelming number of noise sources in the urban environment and the scarcity of labeled data makes it nearly impossible to create classification models with large enough vocabularies that capture the true dynamism of urban soundscapes In this paper, we first identify a set of requirements in the yet unexplored domain of urban soundscape exploration. To satisfy the requirements and tackle the identified challenges, we propose Urban Rhapsody, a framework that combines state-of-the-art audio representation, machine learning, and visual analytics to allow users to interactively create classification models, understand noise patterns of a city, and quickly retrieve and label audio excerpts in order to create a large high-precision annotated database of urban sound recordings. We demonstrate the tool's utility through case studies performed by domain experts using data generated over the five-year deployment of a one-of-a-kind sensor network in New York City.
    Semi-supervised Drifted Stream Learning with Short Lookback. (arXiv:2205.13066v1 [cs.LG])
    In many scenarios, 1) data streams are generated in real time; 2) labeled data are expensive and only limited labels are available in the beginning; 3) real-world data is not always i.i.d. and data drift over time gradually; 4) the storage of historical streams is limited and model updating can only be achieved based on a very short lookback window. This learning setting limits the applicability and availability of many Machine Learning (ML) algorithms. We generalize the learning task under such setting as a semi-supervised drifted stream learning with short lookback problem (SDSL). SDSL imposes two under-addressed challenges on existing methods in semi-supervised learning, continuous learning, and domain adaptation: 1) robust pseudo-labeling under gradual shifts and 2) anti-forgetting adaptation with short lookback. To tackle these challenges, we propose a principled and generic generation-replay framework to solve SDSL. The framework is able to accomplish: 1) robust pseudo-labeling in the generation step; 2) anti-forgetting adaption in the replay step. To achieve robust pseudo-labeling, we develop a novel pseudo-label classification model to leverage supervised knowledge of previously labeled data, unsupervised knowledge of new data, and, structure knowledge of invariant label semantics. To achieve adaptive anti-forgetting model replay, we propose to view the anti-forgetting adaptation task as a flat region search problem. We propose a novel minimax game-based replay objective function to solve the flat region search problem and develop an effective optimization solver. Finally, we present extensive experiments to demonstrate our framework can effectively address the task of anti-forgetting learning in drifted streams with short lookback.
    Factorized Structured Regression for Large-Scale Varying Coefficient Models. (arXiv:2205.13080v1 [stat.ML])
    Recommender Systems (RS) pervade many aspects of our everyday digital life. Proposed to work at scale, state-of-the-art RS allow the modeling of thousands of interactions and facilitate highly individualized recommendations. Conceptually, many RS can be viewed as instances of statistical regression models that incorporate complex feature effects and potentially non-Gaussian outcomes. Such structured regression models, including time-aware varying coefficients models, are, however, limited in their applicability to categorical effects and inclusion of a large number of interactions. Here, we propose Factorized Structured Regression (FaStR) for scalable varying coefficient models. FaStR overcomes limitations of general regression models for large-scale data by combining structured additive regression and factorization approaches in a neural network-based model implementation. This fusion provides a scalable framework for the estimation of statistical models in previously infeasible data settings. Empirical results confirm that the estimation of varying coefficients of our approach is on par with state-of-the-art regression techniques, while scaling notably better and also being competitive with other time-aware RS in terms of prediction performance. We illustrate FaStR's performance and interpretability on a large-scale behavioral study with smartphone user data.
    Transferable Adversarial Attack based on Integrated Gradients. (arXiv:2205.13152v1 [cs.LG])
    The vulnerability of deep neural networks to adversarial examples has drawn tremendous attention from the community. Three approaches, optimizing standard objective functions, exploiting attention maps, and smoothing decision surfaces, are commonly used to craft adversarial examples. By tightly integrating the three approaches, we propose a new and simple algorithm named Transferable Attack based on Integrated Gradients (TAIG) in this paper, which can find highly transferable adversarial examples for black-box attacks. Unlike previous methods using multiple computational terms or combining with other methods, TAIG integrates the three approaches into one single term. Two versions of TAIG that compute their integrated gradients on a straight-line path and a random piecewise linear path are studied. Both versions offer strong transferability and can seamlessly work together with the previous methods. Experimental results demonstrate that TAIG outperforms the state-of-the-art methods. The code will available at https://github.com/yihuang2016/TAIG
    Learning to Query Internet Text for Informing Reinforcement Learning Agents. (arXiv:2205.13079v1 [cs.LG])
    Generalization to out of distribution tasks in reinforcement learning is a challenging problem. One successful approach improves generalization by conditioning policies on task or environment descriptions that provide information about the current transition or reward functions. Previously, these descriptions were often expressed as generated or crowd sourced text. In this work, we begin to tackle the problem of extracting useful information from natural language found in the wild (e.g. internet forums, documentation, and wikis). These natural, pre-existing sources are especially challenging, noisy, and large and present novel challenges compared to previous approaches. We propose to address these challenges by training reinforcement learning agents to learn to query these sources as a human would, and we experiment with how and when an agent should query. To address the \textit{how}, we demonstrate that pretrained QA models perform well at executing zero-shot queries in our target domain. Using information retrieved by a QA model, we train an agent to learn \textit{when} it should execute queries. We show that our method correctly learns to execute queries to maximize reward in a reinforcement learning setting.
    TSEM: Temporally Weighted Spatiotemporal Explainable Neural Network for Multivariate Time Series. (arXiv:2205.13012v1 [cs.LG])
    Deep learning has become a one-size-fits-all solution for technical and business domains thanks to its flexibility and adaptability. It is implemented using opaque models, which unfortunately undermines the outcome trustworthiness. In order to have a better understanding of the behavior of a system, particularly one driven by time series, a look inside a deep learning model so-called posthoc eXplainable Artificial Intelligence (XAI) approaches, is important. There are two major types of XAI for time series data, namely model-agnostic and model-specific. Model-specific approach is considered in this work. While other approaches employ either Class Activation Mapping (CAM) or Attention Mechanism, we merge the two strategies into a single system, simply called the Temporally Weighted Spatiotemporal Explainable Neural Network for Multivariate Time Series (TSEM). TSEM combines the capabilities of RNN and CNN models in such a way that RNN hidden units are employed as attention weights for the CNN feature maps temporal axis. The result shows that TSEM outperforms XCM. It is similar to STAM in terms of accuracy, while also satisfying a number of interpretability criteria, including causality, fidelity, and spatiotemporality.
    Preference Dynamics Under Personalized Recommendations. (arXiv:2205.13026v1 [cs.LG])
    Many projects (both practical and academic) have designed algorithms to match users to content they will enjoy under the assumption that user's preferences and opinions do not change with the content they see. Evidence suggests that individuals' preferences are directly shaped by what content they see -- radicalization, rabbit holes, polarization, and boredom are all example phenomena of preferences affected by content. Polarization in particular can occur even in ecosystems with "mass media," where no personalization takes place, as recently explored in a natural model of preference dynamics by~\citet{hkazla2019geometric} and~\citet{gaitonde2021polarization}. If all users' preferences are drawn towards content they already like, or are repelled from content they already dislike, uniform consumption of media leads to a population of heterogeneous preferences converging towards only two poles. In this work, we explore whether some phenomenon akin to polarization occurs when users receive \emph{personalized} content recommendations. We use a similar model of preference dynamics, where an individual's preferences move towards content the consume and enjoy, and away from content they consume and dislike. We show that standard user reward maximization is an almost trivial goal in such an environment (a large class of simple algorithms will achieve only constant regret). A more interesting objective, then, is to understand under what conditions a recommendation algorithm can ensure stationarity of user's preferences. We show how to design a content recommendations which can achieve approximate stationarity, under mild conditions on the set of available content, when a user's preferences are known, and how one can learn enough about a user's preferences to implement such a strategy even when user preferences are initially unknown.
    Trainable Weight Averaging for Fast Convergence and Better Generalization. (arXiv:2205.13104v1 [cs.LG])
    Stochastic gradient descent (SGD) and its variants are commonly considered as the de-facto methods to train deep neural networks (DNNs). While recent improvements to SGD mainly focus on the descent algorithm itself, few works pay attention to utilizing the historical solutions -- as an iterative method, SGD has actually gone through substantial explorations before its final convergence. Recently, an interesting attempt is stochastic weight averaging (SWA), which significantly improves the generalization by simply averaging the solutions at the tail stage of training. In this paper, we propose to optimize the averaging coefficients, leading to our Trainable Weight Averaging (TWA), essentially a novel training method in a reduced subspace spanned by historical solutions. TWA is quite efficient and has good generalization capability as the degree of freedom for training is small. It largely reduces the estimation error from SWA, making it not only further improve the SWA solutions but also take full advantage of the solutions generated in the head of training where SWA fails. In the extensive numerical experiments, (i) TWA achieves consistent improvements over SWA with less sensitivity to learning rate; (ii) applying TWA in the head stage of training largely speeds up the convergence, resulting in over 40% time saving on CIFAR and 30% on ImageNet with improved generalization compared with regular training. The code is released at https://github.com/nblt/TWA.
    Designing an Efficient End-to-end Machine Learning Pipeline for Real-time Empty-shelf Detection. (arXiv:2205.13060v1 [cs.LG])
    On-Shelf Availability (OSA) of products in retail stores is a critical business criterion in the fast moving consumer goods and retails sector. When a product is out-of-stock (OOS) and a customer cannot find it on its designed shelf, this causes a negative impact on the customer's behaviors and future demands. Several methods are being adopted by retailers today to detect empty shelves and ensure high OSA of products; however, such methods are generally ineffective and infeasible since they are either manual, expensive or less accurate. Recently machine learning based solutions have been proposed, but they suffer from high computation cost and low accuracy problem due to lack of large annotated datasets of on-shelf products. Here, we present an elegant approach for designing an end-to-end machine learning (ML) pipeline for real-time empty shelf detection. Considering the strong dependency between the quality of ML models and the quality of data, we focus on the importance of proper data collection, cleaning and correct data annotation before delving into modeling. Since an empty-shelf detection solution should be computationally-efficient for real-time predictions, we explore different run-time optimizations to improve the model performance. Our dataset contains 1000 images, collected and annotated by following well-defined guidelines. Our low-latency model achieves a mean average F1-score of 68.5%, and can process up to 67 images/s on Intel Xeon Gold and up to 860 images/s on an A100 GPU. Our annotated dataset is publicly available along with our optimized models.
    BRIGHT -- Graph Neural Networks in Real-Time Fraud Detection. (arXiv:2205.13084v1 [cs.LG])
    Detecting fraudulent transactions is an essential component to control risk in e-commerce marketplaces. Apart from rule-based and machine learning filters that are already deployed in production, we want to enable efficient real-time inference with graph neural networks (GNNs), which is useful to catch multihop risk propagation in a transaction graph. However, two challenges arise in the implementation of GNNs in production. First, future information in a dynamic graph should not be considered in message passing to predict the past. Second, the latency of graph query and GNN model inference is usually up to hundreds of milliseconds, which is costly for some critical online services. To tackle these challenges, we propose a Batch and Real-time Inception GrapH Topology (BRIGHT) framework to conduct an end-to-end GNN learning that allows efficient online real-time inference. BRIGHT framework consists of a graph transformation module (Two-Stage Directed Graph) and a corresponding GNN architecture (Lambda Neural Network). The Two-Stage Directed Graph guarantees that the information passed through neighbors is only from the historical payment transactions. It consists of two subgraphs representing historical relationships and real-time links, respectively. The Lambda Neural Network decouples inference into two stages: batch inference of entity embeddings and real-time inference of transaction prediction. Our experiments show that BRIGHT outperforms the baseline models by >2\% in average w.r.t.~precision. Furthermore, BRIGHT is computationally efficient for real-time fraud detection. Regarding end-to-end performance (including neighbor query and inference), BRIGHT can reduce the P99 latency by >75\%. For the inference stage, our speedup is on average 7.8$\times$ compared to the traditional GNN.
    EvoVGM: A Deep Variational Generative Model for Evolutionary Parameter Estimation. (arXiv:2205.13034v1 [cs.LG])
    Most evolutionary-oriented deep generative models do not explicitly consider the underlying evolutionary dynamics of biological sequences as it is performed within the Bayesian phylogenetic inference framework. In this study, we propose a method for a deep variational Bayesian generative model that jointly approximates the true posterior of local biological evolutionary parameters and generates sequence alignments. Moreover, it is instantiated and tuned for continuous-time Markov chain substitution models such as JC69 and GTR. We train the model via a low-variance variational objective function and a gradient ascent algorithm. Here, we show the consistency and effectiveness of the method on synthetic sequence alignments simulated with several evolutionary scenarios and on a real virus sequence alignment.
    Green Hierarchical Vision Transformer for Masked Image Modeling. (arXiv:2205.13515v1 [cs.CV])
    We present an efficient approach for Masked Image Modeling (MIM) with hierarchical Vision Transformers (ViTs), e.g., Swin Transformer, allowing the hierarchical ViTs to discard masked patches and operate only on the visible ones. Our approach consists of two key components. First, for the window attention, we design a Group Window Attention scheme following the Divide-and-Conquer strategy. To mitigate the quadratic complexity of the self-attention w.r.t. the number of patches, group attention encourages a uniform partition that visible patches within each local window of arbitrary size can be grouped with equal size, where masked self-attention is then performed within each group. Second, we further improve the grouping strategy via the Dynamic Programming algorithm to minimize the overall computation cost of the attention on the grouped patches. As a result, MIM now can work on hierarchical ViTs in a green and efficient way. For example, we can train the hierarchical ViTs about 2.7$\times$ faster and reduce the GPU memory usage by 70%, while still enjoying competitive performance on ImageNet classification and the superiority on downstream COCO object detection benchmarks. Code and pre-trained models have been made publicly available at https://github.com/LayneH/GreenMIM.
    BiT: Robustly Binarized Multi-distilled Transformer. (arXiv:2205.13016v1 [cs.LG])
    Modern pre-trained transformers have rapidly advanced the state-of-the-art in machine learning, but have also grown in parameters and computational complexity, making them increasingly difficult to deploy in resource-constrained environments. Binarization of the weights and activations of the network can significantly alleviate these issues, however is technically challenging from an optimization perspective. In this work, we identify a series of improvements which enables binary transformers at a much higher accuracy than what was possible previously. These include a two-set binarization scheme, a novel elastic binary activation function with learned parameters, and a method to quantize a network to its limit by successively distilling higher precision models into lower precision students. These approaches allow for the first time, fully binarized transformer models that are at a practical level of accuracy, approaching a full-precision BERT baseline on the GLUE language understanding benchmark within as little as 5.9%.
    People counting system for retail analytics using edge AI. (arXiv:2205.13020v1 [cs.LG])
    Developments in IoT applications are playing an important role in our day-to-day life, starting from business predictions to self driving cars. One of the area, most influenced by the field of AI and IoT is retail analytics. In Retail Analytics, Conversion Rates - a metric which is most often used by retail stores to measure how many people have visited the store and how many purchases has happened. This retail conversion rate assess the marketing operations, increasing stock, store outlet and running promotions ..etc. Our project intends to build a cost-effective people counting system with AI at Edge, where it calculates Conversion rates using total number of people counted by the system and number of transactions for the day, which helps in providing analytical insights for retail store optimization with a very minimum hardware requirements.
    Undersampling is a Minimax Optimal Robustness Intervention in Nonparametric Classification. (arXiv:2205.13094v1 [cs.LG])
    While a broad range of techniques have been proposed to tackle distribution shift, the simple baseline of training on an $\textit{undersampled}$ dataset often achieves close to state-of-the-art-accuracy across several popular benchmarks. This is rather surprising, since undersampling algorithms discard excess majority group data. To understand this phenomenon, we ask if learning is fundamentally constrained by a lack of minority group samples. We prove that this is indeed the case in the setting of nonparametric binary classification. Our results show that in the worst case, an algorithm cannot outperform undersampling unless there is a high degree of overlap between the train and test distributions (which is unlikely to be the case in real-world datasets), or if the algorithm leverages additional structure about the distribution shift. In particular, in the case of label shift we show that there is always an undersampling algorithm that is minimax optimal. While in the case of group-covariate shift we show that there is an undersampling algorithm that is minimax optimal when the overlap between the group distributions is small. We also perform an experimental case study on a label shift dataset and find that in line with our theory the test accuracy of robust neural network classifiers is constrained by the number of minority samples.
    Improving Subgraph Representation Learning via Multi-View Augmentation. (arXiv:2205.13038v1 [cs.LG])
    Subgraph representation learning based on Graph Neural Network (GNN) has broad applications in chemistry and biology, such as molecule property prediction and gene collaborative function prediction. On the other hand, graph augmentation techniques have shown promising results in improving graph-based and node-based classification tasks but are rarely explored in the GNN-based subgraph representation learning literature. In this work, we developed a novel multiview augmentation mechanism to improve subgraph representation learning and thus the accuracy of downstream prediction tasks. The augmentation technique creates multiple variants of subgraphs and embeds these variants into the original graph to achieve both high training efficiency, scalability, and improved accuracy. Experiments on several real-world subgraph benchmarks demonstrate the superiority of our proposed multi-view augmentation techniques.
    Online Deep Equilibrium Learning for Regularization by Denoising. (arXiv:2205.13051v1 [eess.IV])
    Plug-and-Play Priors (PnP) and Regularization by Denoising (RED) are widely-used frameworks for solving imaging inverse problems by computing fixed-points of operators combining physical measurement models and learned image priors. While traditional PnP/RED formulations have focused on priors specified using image denoisers, there is a growing interest in learning PnP/RED priors that are end-to-end optimal. The recent Deep Equilibrium Models (DEQ) framework has enabled memory-efficient end-to-end learning of PnP/RED priors by implicitly differentiating through the fixed-point equations without storing intermediate activation values. However, the dependence of the computational/memory complexity of the measurement models in PnP/RED on the total number of measurements leaves DEQ impractical for many imaging applications. We propose ODER as a new strategy for improving the efficiency of DEQ through stochastic approximations of the measurement models. We theoretically analyze ODER giving insights into its convergence and ability to approximate the traditional DEQ approach. Our numerical results suggest the potential improvements in training/testing complexity due to ODER on three distinct imaging applications.
    Scalable and Low-Latency Federated Learning with Cooperative Mobile Edge Networking. (arXiv:2205.13054v1 [cs.DC])
    Federated learning (FL) enables collaborative model training without centralizing data. However, the traditional FL framework is cloud-based and suffers from high communication latency. On the other hand, the edge-based FL framework that relies on an edge server co-located with access point for model aggregation has low communication latency but suffers from degraded model accuracy due to the limited coverage of edge server. In light of high-accuracy but high-latency cloud-based FL and low-latency but low-accuracy edge-based FL, this paper proposes a new FL framework based on cooperative mobile edge networking called cooperative federated edge learning (CFEL) to enable both high-accuracy and low-latency distributed intelligence at mobile edge networks. Considering the unique two-tier network architecture of CFEL, a novel federated optimization method dubbed cooperative edge-based federated averaging (CE-FedAvg) is further developed, wherein each edge server both coordinates collaborative model training among the devices within its own coverage and cooperates with other edge servers to learn a shared global model through decentralized consensus. Experimental results based on benchmark datasets show that CFEL can largely speed up the convergence speed and reduce the training time to achieve a target model accuracy compared with prior FL frameworks.
    Discovering Policies with DOMiNO: Diversity Optimization Maintaining Near Optimality. (arXiv:2205.13521v1 [cs.AI])
    Finding different solutions to the same problem is a key aspect of intelligence associated with creativity and adaptation to novel situations. In reinforcement learning, a set of diverse policies can be useful for exploration, transfer, hierarchy, and robustness. We propose DOMiNO, a method for Diversity Optimization Maintaining Near Optimality. We formalize the problem as a Constrained Markov Decision Process where the objective is to find diverse policies, measured by the distance between the state occupancies of the policies in the set, while remaining near-optimal with respect to the extrinsic reward. We demonstrate that the method can discover diverse and meaningful behaviors in various domains, such as different locomotion patterns in the DeepMind Control Suite. We perform extensive analysis of our approach, compare it with other multi-objective baselines, demonstrate that we can control both the quality and the diversity of the set via interpretable hyperparameters, and show that the discovered set is robust to perturbations.
    Forest Fire Clustering for Single-cell Sequencing with Iterative Label Propagation and Parallelized Monte Carlo Simulation. (arXiv:2103.11802v4 [cs.LG] UPDATED)
    In the era of single-cell sequencing, there is a growing need to extract insights from data with clustering methods. Here, we introduce Forest Fire Clustering, an efficient and interpretable method for cell-type discovery from single-cell data. Forest Fire Clustering makes minimal prior assumptions and, different from current approaches, calculates a non-parametric posterior probability that each cell is assigned a cell-type label. These posterior distributions allow for the evaluation of a label confidence for each cell and enable the computation of "label entropies," highlighting transitions along developmental trajectories. Furthermore, we show that Forest Fire Clustering can make robust, inductive inferences in an online-learning context and can readily scale to millions of cells. Finally, we demonstrate that our method outperforms state-of-the-art clustering approaches on diverse benchmarks of simulated and experimental data. Overall, Forest Fire Clustering is a useful tool for rare cell type discovery in large-scale single-cell analysis.
    Comparison of Traditional and Hybrid Time Series Models for Forecasting COVID-19 Cases. (arXiv:2105.03266v2 [cs.SI] UPDATED)
    Time series forecasting methods play critical role in estimating the spread of an epidemic. The coronavirus outbreak of December 2019 has already infected millions all over the world and continues to spread on. Just when the curve of the outbreak had started to flatten, many countries have again started to witness a rise in cases which is now being referred as the 2nd wave of the pandemic. A thorough analysis of time-series forecasting models is therefore required to equip state authorities and health officials with immediate strategies for future times. This aims of the study are three-fold: (a) To model the overall trend of the spread; (b) To generate a short-term forecast of 10 days in countries with the highest incidence of confirmed cases (USA, India and Brazil); (c) To quantitatively determine the algorithm that is best suited for precise modelling of the linear and non-linear features of the time series. The comparison of forecasting models for the total cumulative cases of each country is carried out by comparing the reported data and the predicted value, and then ranking the algorithms (Prophet, Holt-Winters, LSTM, ARIMA, and ARIMA-NARNN) based on their RMSE, MAE and MAPE values. The hybrid combination of ARIMA and NARNN (Nonlinear Auto-Regression Neural Network) gave the best result among the selected models with a reduced RMSE, which proved to be almost 35.3% better than one of the most prevalent method of time-series prediction (ARIMA). The results demonstrated the efficacy of the hybrid implementation of the ARIMA-NARNN model over other forecasting methods such as Prophet, Holt Winters, LSTM, and the ARIMA model in encapsulating the linear as well as non-linear patterns of the epidemical datasets.
    Kernel Ridgeless Regression is Inconsistent for Low Dimensions. (arXiv:2205.13525v1 [cs.LG])
    We show that kernel interpolation for a large class of shift-invariant kernels is inconsistent in fixed dimension, even with bandwidth adaptive to the training set.
    Continual evaluation for lifelong learning: Identifying the stability gap. (arXiv:2205.13452v1 [cs.LG])
    Introducing a time dependency on the data generating distribution has proven to be difficult for gradient-based training of neural networks, as the greedy updates result in catastrophic forgetting of previous timesteps. Continual learning aims to overcome the greedy optimization to enable continuous accumulation of knowledge over time. The data stream is typically divided into locally stationary distributions, called tasks, allowing task-based evaluation on held-out data from the training tasks. Contemporary evaluation protocols and metrics in continual learning are task-based and quantify the trade-off between stability and plasticity only at task transitions. However, our empirical evidence suggests that between task transitions significant, temporary forgetting can occur, remaining unidentified in task-based evaluation. Therefore, we propose a framework for continual evaluation that establishes per-iteration evaluation and define a new set of metrics that enables identifying the worst-case performance of the learner over its lifetime. Performing continual evaluation, we empirically identify that replay suffers from a stability gap: upon learning a new task, there is a substantial but transient decrease in performance on past tasks. Further conceptual and empirical analysis suggests not only replay-based, but also regularization-based continual learning methods are prone to the stability gap.
    Are Transformers Effective for Time Series Forecasting?. (arXiv:2205.13504v1 [cs.AI])
    Recently, there has been a surge of Transformer-based solutions for the time series forecasting (TSF) task, especially for the challenging long-term TSF problem. Transformer architecture relies on self-attention mechanisms to effectively extract the semantic correlations between paired elements in a long sequence, which is permutation-invariant and anti-ordering to some extent. However, in time series modeling, we are to extract the temporal relations among an ordering set of continuous points. Consequently, whether Transformer-based techniques are the right solutions for long-term time series forecasting is an interesting problem to investigate, despite the performance improvements shown in these studies. In this work, we question the validity of Transformer-based TSF solutions. In their experiments, the compared (non-Transformer) baselines are mainly autoregressive forecasting solutions, which usually have a poor long-term prediction capability due to inevitable error accumulation effects. In contrast, we use an embarrassingly simple architecture named DLinear that conducts direct multi-step (DMS) forecasting for comparison. DLinear decomposes the time series into a trend and a remainder series and employs two one-layer linear networks to model these two series for the forecasting task. Surprisingly, it outperforms existing complex Transformer-based models in most cases by a large margin. Therefore, we conclude that the relatively higher long-term forecasting accuracy of Transformer-based TSF solutions shown in existing works has little to do with the temporal relation extraction capabilities of the Transformer architecture. Instead, it is mainly due to the non-autoregressive DMS forecasting strategy used in them. We hope this study also advocates revisiting the validity of Transformer-based solutions for other time series analysis tasks (e.g., anomaly detection) in the future.
    Embed to Control Partially Observed Systems: Representation Learning with Provable Sample Efficiency. (arXiv:2205.13476v1 [cs.LG])
    Reinforcement learning in partially observed Markov decision processes (POMDPs) faces two challenges. (i) It often takes the full history to predict the future, which induces a sample complexity that scales exponentially with the horizon. (ii) The observation and state spaces are often continuous, which induces a sample complexity that scales exponentially with the extrinsic dimension. Addressing such challenges requires learning a minimal but sufficient representation of the observation and state histories by exploiting the structure of the POMDP. To this end, we propose a reinforcement learning algorithm named Embed to Control (ETC), which learns the representation at two levels while optimizing the policy.~(i) For each step, ETC learns to represent the state with a low-dimensional feature, which factorizes the transition kernel. (ii) Across multiple steps, ETC learns to represent the full history with a low-dimensional embedding, which assembles the per-step feature. We integrate (i) and (ii) in a unified framework that allows a variety of estimators (including maximum likelihood estimators and generative adversarial networks). For a class of POMDPs with a low-rank structure in the transition kernel, ETC attains an $O(1/\epsilon^2)$ sample complexity that scales polynomially with the horizon and the intrinsic dimension (that is, the rank). Here $\epsilon$ is the optimality gap. To our best knowledge, ETC is the first sample-efficient algorithm that bridges representation learning and policy optimization in POMDPs with infinite observation and state spaces.
    Your Transformer May Not be as Powerful as You Expect. (arXiv:2205.13401v1 [cs.LG])
    Relative Positional Encoding (RPE), which encodes the relative distance between any pair of tokens, is one of the most successful modifications to the original Transformer. As far as we know, theoretical understanding of the RPE-based Transformers is largely unexplored. In this work, we mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions. One may naturally assume the answer is in the affirmative -- RPE-based Transformers are universal function approximators. However, we present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is. One key reason lies in that most RPEs are placed in the softmax attention that always generates a right stochastic matrix. This restricts the network from capturing positional information in the RPEs and limits its capacity. To overcome the problem and make the model more powerful, we first present sufficient conditions for RPE-based Transformers to achieve universal function approximation. With the theoretical guidance, we develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions. Therefore, the corresponding URPE-based Transformers become universal function approximators. Extensive experiments covering typical architectures and tasks demonstrate that our model is parameter-efficient and can achieve superior performance to strong baselines in a wide range of applications.
    Follow-the-Perturbed-Leader for Adversarial Markov Decision Processes with Bandit Feedback. (arXiv:2205.13451v1 [cs.LG])
    We consider regret minimization for Adversarial Markov Decision Processes (AMDPs), where the loss functions are changing over time and adversarially chosen, and the learner only observes the losses for the visited state-action pairs (i.e., bandit feedback). While there has been a surge of studies on this problem using Online-Mirror-Descent (OMD) methods, very little is known about the Follow-the-Perturbed-Leader (FTPL) methods, which are usually computationally more efficient and also easier to implement since it only requires solving an offline planning problem. Motivated by this, we take a closer look at FTPL for learning AMDPs, starting from the standard episodic finite-horizon setting. We find some unique and intriguing difficulties in the analysis and propose a workaround to eventually show that FTPL is also able to achieve near-optimal regret bounds in this case. More importantly, we then find two significant applications: First, the analysis of FTPL turns out to be readily generalizable to delayed bandit feedback with order-optimal regret, while OMD methods exhibit extra difficulties (Jin et al., 2022). Second, using FTPL, we also develop the first no-regret algorithm for learning communicating AMDPs in the infinite-horizon setting with bandit feedback and stochastic transitions. Our algorithm is efficient assuming access to an offline planning oracle, while even for the easier full-information setting, the only existing algorithm (Chandrasekaran and Tewari, 2021) is computationally inefficient.
    The Neuro-Symbolic Brain. (arXiv:2205.13440v1 [cs.NE])
    Neural networks promote a distributed representation with no clear place for symbols. Despite this, we propose that symbols are manufactured simply by training a sparse random noise as a self-sustaining attractor in a feedback spiking neural network. This way, we can generate many of what we shall call prime attractors, and the networks that support them are like registers holding a symbolic value, and we call them registers. Like symbols, prime attractors are atomic and devoid of any internal structure. Moreover, the winner-take-all mechanism naturally implemented by spiking neurons enables registers to recover a prime attractor within a noisy signal. Using this faculty, when considering two connected registers, an input one and an output one, it is possible to bind in one shot using a Hebbian rule the attractor active on the output to the attractor active on the input. Thus, whenever an attractor is active on the input, it induces its bound attractor on the output; even though the signal gets blurrier with more bindings, the winner-take-all filtering faculty can recover the bound prime attractor. However, the capacity is still limited. It is also possible to unbind in one shot, restoring the capacity taken by that binding. This mechanism serves as a basis for working memory, turning prime attractors into variables. Also, we use a random second-order network to amalgamate the prime attractors held by two registers to bind the prime attractor held by a third register to them in one shot, de facto implementing a hash table. Furthermore, we introduce the register switch box composed of registers to move the content of one register to another. Then, we use spiking neurons to build a toy symbolic computer based on the above. The technics used suggest ways to design extrapolating, reusable, sample-efficient deep learning networks at the cost of structural priors.
    Principled Knowledge Extrapolation with GANs. (arXiv:2205.13444v1 [cs.LG])
    Human can extrapolate well, generalize daily knowledge into unseen scenarios, raise and answer counterfactual questions. To imitate this ability via generative models, previous works have extensively studied explicitly encoding Structural Causal Models (SCMs) into architectures of generator networks. This methodology, however, limits the flexibility of the generator as they must be carefully crafted to follow the causal graph, and demands a ground truth SCM with strong ignorability assumption as prior, which is a nontrivial assumption in many real scenarios. Thus, many current causal GAN methods fail to generate high fidelity counterfactual results as they cannot easily leverage state-of-the-art generative models. In this paper, we propose to study counterfactual synthesis from a new perspective of knowledge extrapolation, where a given knowledge dimension of the data distribution is extrapolated, but the remaining knowledge is kept indistinguishable from the original distribution. We show that an adversarial game with a closed-form discriminator can be used to address the knowledge extrapolation problem, and a novel principal knowledge descent method can efficiently estimate the extrapolated distribution through the adversarial game. Our method enjoys both elegant theoretical guarantees and superior performance in many scenarios.
    A framework for overparameterized learning. (arXiv:2205.13507v1 [cs.LG])
    An explanation for the success of deep neural networks is a central question in theoretical machine learning. According to classical statistical learning, the overparameterized nature of such models should imply a failure to generalize. Many argue that good empirical performance is due to the implicit regularization of first order optimization methods. In particular, the Polyak-{\L}ojasiewicz condition leads to gradient descent finding a global optimum that is close to initialization. In this work, we propose a framework consisting of a prototype learning problem, which is general enough to cover many popular problems and even the cases of infinitely wide neural networks and infinite data. We then perform an analysis from the perspective of the Polyak-{\L}ojasiewicz condition. We obtain theoretical results of independent interest, concerning gradient descent on a composition $(f \circ F): G \to \mathbb{R}$ of functions $F: G \to H$ and $f: H \to \mathbb{R}$ with $G, H$ being Hilbert spaces. Building on these results, we determine the properties that have to be satisfied by the components of the prototype problem for gradient descent to find a global optimum that is close to initialization. We then demonstrate that supervised learning, variational autoencoders and training with gradient penalty can be translated to the prototype problem. Finally, we lay out a number of directions for future research.
    A Rotated Hyperbolic Wrapped Normal Distribution for Hierarchical Representation Learning. (arXiv:2205.13371v1 [cs.LG])
    We present a rotated hyperbolic wrapped normal distribution (RoWN), a simple yet effective alteration of a hyperbolic wrapped normal distribution (HWN). The HWN expands the domain of probabilistic modeling from Euclidean to hyperbolic space, where a tree can be embedded with arbitrary low distortion in theory. In this work, we analyze the geometric properties of the diagonal HWN, a standard choice of distribution in probabilistic modeling. The analysis shows that the distribution is inappropriate to represent the data points at the same hierarchy level through their angular distance with the same norm in the Poincar\'e disk model. We then empirically verify the presence of limitations of HWN, and show how RoWN, the newly proposed distribution, can alleviate the limitations on various hierarchical datasets, including noisy synthetic binary tree, WordNet, and Atari 2600 Breakout.
    Multi-fidelity power flow solver. (arXiv:2205.13362v1 [cs.LG])
    We propose a multi-fidelity neural network (MFNN) tailored for rapid high-dimensional grid power flow simulations and contingency analysis with scarce high-fidelity contingency data. The proposed model comprises two networks -- the first one trained on DC approximation as low-fidelity data and coupled to a high-fidelity neural net trained on both low- and high-fidelity power flow data. Each network features a latent module which parametrizes the model by a discrete grid topology vector for generalization (e.g., $n$ power lines with $k$ disconnections or contingencies, if any), and the targeted high-fidelity output is a weighted sum of linear and nonlinear functions. We tested the model on 14- and 118-bus test cases and evaluated its performance based on the $n-k$ power flow prediction accuracy with respect to imbalanced contingency data and high-to-low-fidelity sample ratio. The results presented herein demonstrate MFNN's potential and its limits with up to two orders of magnitude faster and more accurate power flow solutions than DC approximation.
    Constrained Reinforcement Learning for Short Video Recommendation. (arXiv:2205.13248v1 [cs.LG])
    The wide popularity of short videos on social media poses new opportunities and challenges to optimize recommender systems on the video-sharing platforms. Users provide complex and multi-faceted responses towards recommendations, including watch time and various types of interactions with videos. As a result, established recommendation algorithms that concern a single objective are not adequate to meet this new demand of optimizing comprehensive user experiences. In this paper, we formulate the problem of short video recommendation as a constrained Markov Decision Process (MDP), where platforms want to optimize the main goal of user watch time in long term, with the constraint of accommodating the auxiliary responses of user interactions such as sharing/downloading videos. To solve the constrained MDP, we propose a two-stage reinforcement learning approach based on actor-critic framework. At stage one, we learn individual policies to optimize each auxiliary response. At stage two, we learn a policy to (i) optimize the main response and (ii) stay close to policies learned at the first stage, which effectively guarantees the performance of this main policy on the auxiliaries. Through extensive simulations, we demonstrate effectiveness of our approach over alternatives in both optimizing the main goal as well as balancing the others. We further show the advantage of our approach in live experiments of short video recommendations, where it significantly outperforms other baselines in terms of watch time and interactions from video views. Our approach has been fully launched in the production system to optimize user experiences on the platform.
    Learning to Accelerate by the Methods of Step-size Planning. (arXiv:2204.01705v4 [cs.LG] UPDATED)
    Gradient descent is slow to converge for ill-conditioned problems and non-convex problems. An important technique for acceleration is step-size adaptation. The first part of this paper contains a detailed review of step-size adaptation methods, including Polyak step-size, L4, LossGrad, Adam, IDBD, and Hypergradient descent, and the relation of step-size adaptation to meta-gradient methods. In the second part of this paper, we propose a new class of methods of accelerating gradient descent that have some distinctiveness from existing techniques. The new methods, which we call {\em step-size planning}, use the {\em update experience} to learn an improved way of updating the parameters. The methods organize the experience into $K$ steps away from each other to facilitate planning. From the past experience, our planning algorithm, Csawg, learns a step-size model which is a form of multi-step machine that predicts future updates. We extends Csawg to applying step-size planning multiple steps, which leads to further speedup. We discuss and highlight the projection power of the diagonal-matrix step-size for future large scale applications. We show for a convex problem, our methods can surpass the convergence rate of Nesterov's accelerated gradient, $1 - \sqrt{\mu/L}$, where $\mu, L$ are the strongly convex factor of the loss function $F$ and the Lipschitz constant of $F'$, which is the theoretical limit for the convergence rate of first-order methods. On the well-known non-convex Rosenbrock function, our planning methods achieve zero error below 500 gradient evaluations, while gradient descent takes about 10000 gradient evaluations to reach a $10^{-3}$ accuracy. We discuss the connection of step-size planing to planning in reinforcement learning, in particular, Dyna architectures. (This is a shorter abstract than in the paper because of length requirement)
    DT+GNN: A Fully Explainable Graph Neural Network using Decision Trees. (arXiv:2205.13234v1 [cs.LG])
    We propose the fully explainable Decision Tree Graph Neural Network (DT+GNN) architecture. In contrast to existing black-box GNNs and post-hoc explanation methods, the reasoning of DT+GNN can be inspected at every step. To achieve this, we first construct a differentiable GNN layer, which uses a categorical state space for nodes and messages. This allows us to convert the trained MLPs in the GNN into decision trees. These trees are pruned using our newly proposed method to ensure they are small and easy to interpret. We can also use the decision trees to compute traditional explanations. We demonstrate on both real-world datasets and synthetic GNN explainability benchmarks that this architecture works as well as traditional GNNs. Furthermore, we leverage the explainability of DT+GNNs to find interesting insights into many of these datasets, with some surprising results. We also provide an interactive web tool to inspect DT+GNN's decision making.
    Friends to Help: Saving Federated Learning from Client Dropout. (arXiv:2205.13222v1 [cs.LG])
    Federated learning (FL) is an outstanding distributed machine learning framework due to its benefits on data privacy and communication efficiency. Since full client participation in many cases is infeasible due to constrained resources, partial participation FL algorithms have been investigated that proactively select/sample a subset of clients, aiming to achieve learning performance close to the full participation case. This paper studies a passive partial client participation scenario that is much less well understood, where partial participation is a result of external events, namely client dropout, rather than a decision of the FL algorithm. We cast FL with client dropout as a special case of a larger class of FL problems where clients can submit substitute (possibly inaccurate) local model updates. Based on our convergence analysis, we develop a new algorithm FL-FDMS that discovers friends of clients (i.e., clients whose data distributions are similar) on-the-fly and uses friends' local updates as substitutes for the dropout clients, thereby reducing the substitution error and improving the convergence performance. A complexity reduction mechanism is also incorporated into FL-FDMS, making it both theoretically sound and practically useful. Experiments on MNIST and CIFAR-10 confirmed the superior performance of FL-FDMS in handling client dropout in FL.
    Sym-NCO: Leveraging Symmetricity for Neural Combinatorial Optimization. (arXiv:2205.13209v1 [cs.LG])
    Deep reinforcement learning (DRL)-based combinatorial optimization (CO) methods (i.e., DRL-NCO) have shown significant merit over the conventional CO solvers as DRL-NCO is capable of learning CO solvers without supervised labels attained from the verified solver. This paper presents a novel training scheme, Sym-NCO, that achieves significant performance increments to existing DRL-NCO methods. Sym-NCO is a regularizer-based training scheme that leverages universal symmetricities in various CO problems and solutions. Imposing symmetricities such as rotational and reflectional invariance can greatly improve generalization capability of DRL-NCO as symmetricities are invariant features shared by certain CO tasks. Our experimental results verify that our Sym-NCO greatly improves the performance of DRL-NCO methods in four CO tasks, including traveling salesman problem (TSP), capacitated vehicle routing problem (CVRP), prize collecting TSP (PCTSP), and orienteering problem (OP), without employing problem-specific techniques. Remarkably, Sym-NCO outperformed not only the existing DRL-NCO methods but also a competitive conventional solver, the iterative local search (ILS), in PCTSP at 240 times faster speed.
    Embedding Principle in Depth for the Loss Landscape Analysis of Deep Neural Networks. (arXiv:2205.13283v1 [cs.LG])
    Unraveling the general structure underlying the loss landscapes of deep neural networks (DNNs) is important for the theoretical study of deep learning. Inspired by the embedding principle of DNN loss landscape, we prove in this work an embedding principle in depth that loss landscape of an NN "contains" all critical points of the loss landscapes for shallower NNs. Specifically, we propose a critical lifting operator that any critical point of a shallower network can be lifted to a critical manifold of the target network while preserving the outputs. Through lifting, local minimum of an NN can become a strict saddle point of a deeper NN, which can be easily escaped by first-order methods. The embedding principle in depth reveals a large family of critical points in which layer linearization happens, i.e., computation of certain layers is effectively linear for the training inputs. We empirically demonstrate that, through suppressing layer linearization, batch normalization helps avoid the lifted critical manifolds, resulting in a faster decay of loss. We also demonstrate that increasing training data reduces the lifted critical manifold thus could accelerate the training. Overall, the embedding principle in depth well complements the embedding principle (in width), resulting in a complete characterization of the hierarchical structure of critical points/manifolds of a DNN loss landscape.
    Aggregating Gradients in Encoded Domain for Federated Learning. (arXiv:2205.13216v1 [cs.CR])
    Malicious attackers and an honest-but-curious server can steal private client data from uploaded gradients in federated learning. Although current protection methods (e.g., additive homomorphic cryptosystem) can guarantee the security of the federated learning system, they bring additional computation and communication costs. To mitigate the cost, we propose the \texttt{FedAGE} framework, which enables the server to aggregate gradients in an encoded domain without accessing raw gradients of any single client. Thus, \texttt{FedAGE} can prevent the curious server from gradient stealing while maintaining the same prediction performance without additional communication costs. Furthermore, we theoretically prove that the proposed encoding-decoding framework is a Gaussian mechanism for differential privacy. Finally, we evaluate \texttt{FedAGE} under several federated settings, and the results have demonstrated the efficacy of the proposed framework.
    Federated Non-negative Matrix Factorization for Short Texts Topic Modeling with Mutual Information. (arXiv:2205.13300v1 [cs.CL])
    Non-negative matrix factorization (NMF) based topic modeling is widely used in natural language processing (NLP) to uncover hidden topics of short text documents. Usually, training a high-quality topic model requires large amount of textual data. In many real-world scenarios, customer textual data should be private and sensitive, precluding uploading to data centers. This paper proposes a Federated NMF (FedNMF) framework, which allows multiple clients to collaboratively train a high-quality NMF based topic model with locally stored data. However, standard federated learning will significantly undermine the performance of topic models in downstream tasks (e.g., text classification) when the data distribution over clients is heterogeneous. To alleviate this issue, we further propose FedNMF+MI, which simultaneously maximizes the mutual information (MI) between the count features of local texts and their topic weight vectors to mitigate the performance degradation. Experimental results show that our FedNMF+MI methods outperform Federated Latent Dirichlet Allocation (FedLDA) and the FedNMF without MI methods for short texts by a significant margin on both coherence score and classification F1 score.
    Distributed Contextual Linear Bandits with Minimax Optimal Communication Cost. (arXiv:2205.13170v1 [cs.LG])
    We study distributed contextual linear bandits with stochastic contexts, where $N$ agents act cooperatively to solve a linear bandit-optimization problem with $d$-dimensional features. For this problem, we propose a distributed batch elimination version of the LinUCB algorithm, DisBE-LUCB, where the agents share information among each other through a central server. We prove that over $T$ rounds ($NT$ actions in total) the communication cost of DisBE-LUCB is only $\tilde{\mathcal{O}}(dN)$ and its regret is at most $\tilde{\mathcal{O}}(\sqrt{dNT})$, which is of the same order as that incurred by an optimal single-agent algorithm for $NT$ rounds. Remarkably, we derive an information-theoretic lower bound on the communication cost of the distributed contextual linear bandit problem with stochastic contexts, and prove that our proposed algorithm is nearly minimax optimal in terms of \emph{both regret and communication cost}. Finally, we propose DecBE-LUCB, a fully decentralized version of DisBE-LUCB, which operates without a central server, where agents share information with their \emph{immediate neighbors} through a carefully designed consensus procedure.
    Entropy Maximization with Depth: A Variational Principle for Random Neural Networks. (arXiv:2205.13076v1 [cs.LG])
    To understand the essential role of depth in neural networks, we investigate a variational principle for depth: Does increasing depth perform an implicit optimization for the representations in neural networks? We prove that random neural networks equipped with batch normalization maximize the differential entropy of representations with depth up to constant factors, assuming that the representations are contractive. Thus, representations inherently obey the \textit{principle of maximum entropy} at initialization, in the absence of information about the learning task. Our variational formulation for neural representations characterizes the interplay between representation entropy and architectural components, including depth, width, and non-linear activations, thereby potentially inspiring the design of neural architectures.
    Fast Vision Transformers with HiLo Attention. (arXiv:2205.13213v1 [cs.CV])
    Vision Transformers (ViTs) have triggered the most recent and significant breakthroughs in computer vision. Their efficient designs are mostly guided by the indirect metric of computational complexity, i.e., FLOPs, which however has a clear gap with the direct metric such as throughput. Thus, we propose to use the direct speed evaluation on the target platform as the design principle for efficient ViTs. Particularly, we introduce LITv2, a simple and effective ViT which performs favourably against the existing state-of-the-art methods across a spectrum of different model sizes with faster speed. At the core of LITv2 is a novel self-attention mechanism, which we dub HiLo. HiLo is inspired by the insight that high frequencies in an image capture local fine details and low frequencies focus on global structures, whereas a multi-head self-attention layer neglects the characteristic of different frequencies. Therefore, we propose to disentangle the high/low frequency patterns in an attention layer by separating the heads into two groups, where one group encodes high frequencies via self-attention within each local window, and another group performs the attention to model the global relationship between the average-pooled low-frequency keys from each window and each query position in the input feature map. Benefit from the efficient design for both groups, we show that HiLo is superior to the existing attention mechanisms by comprehensively benchmarking on FLOPs, speed and memory consumption on GPUs. Powered by HiLo, LITv2 serves as a strong backbone for mainstream vision tasks including image classification, dense detection and segmentation. Code is available at https://github.com/zip-group/LITv2.
    More Recent Advances in (Hyper)Graph Partitioning. (arXiv:2205.13202v1 [cs.DS])
    In recent years, significant advances have been made in the design and evaluation of balanced (hyper)graph partitioning algorithms. We survey trends of the last decade in practical algorithms for balanced (hyper)graph partitioning together with future research directions. Our work serves as an update to a previous survey on the topic. In particular, the survey extends the previous survey by also covering hypergraph partitioning and streaming algorithms, and has an additional focus on parallel algorithms.
    Unsupervised Learning From Incomplete Measurements for Inverse Problems. (arXiv:2201.12151v3 [stat.ML] UPDATED)
    In many real-world inverse problems, only incomplete measurement data are available for training which can pose a problem for learning a reconstruction function. Indeed, unsupervised learning using a fixed incomplete measurement process is impossible in general, as there is no information in the nullspace of the measurement operator. This limitation can be overcome by using measurements from multiple operators. While this idea has been successfully applied in various applications, a precise characterization of the conditions for learning is still lacking. In this paper, we fill this gap by presenting necessary and sufficient conditions for learning the underlying signal model needed for reconstruction which indicate the interplay between the number of distinct measurement operators, the number of measurements per operator, the dimension of the model and the dimension of the signals. Furthermore, we propose a novel and conceptually simple unsupervised learning loss which only requires access to incomplete measurement data and achieves a performance on par with supervised learning when the sufficient condition is verified. We validate our theoretical bounds and demonstrate the advantages of the proposed unsupervised loss compared to previous methods via a series of experiments on various imaging inverse problems, such as accelerated magnetic resonance imaging, compressed sensing and image inpainting.
    Forecasting Patient Demand at Urgent Care Clinics using Machine Learning. (arXiv:2205.13067v1 [cs.LG])
    Urgent care clinics and emergency departments around the world periodically suffer from extended wait times beyond patient expectations due to inadequate staffing levels. These delays have been linked with adverse clinical outcomes. Previous research into forecasting demand this domain has mostly used a collection of statistical techniques, with machine learning approaches only now beginning to emerge in recent literature. The forecasting problem for this domain is difficult and has also been complicated by the COVID-19 pandemic which has introduced an additional complexity to this estimation due to typical demand patterns being disrupted. This study explores the ability of machine learning methods to generate accurate patient presentations at two large urgent care clinics located in Auckland, New Zealand. A number of machine learning algorithms were explored in order to determine the most effective technique for this problem domain, with the task of making forecasts of daily patient demand three months in advance. The study also performed an in-depth analysis into the model behaviour in respect to the exploration of which features are most effective at predicting demand and which features are capable of adaptation to the volatility caused by the COVID-19 pandemic lockdowns. The results showed that ensemble-based methods delivered the most accurate and consistent solutions on average, generating improvements in the range of 23%-27% over the existing in-house methods for estimating the daily demand.
    Tight Lower Bounds on Worst-Case Guarantees for Zero-Shot Learning with Attributes. (arXiv:2205.13068v1 [cs.LG])
    We develop a rigorous mathematical analysis of zero-shot learning with attributes. In this setting, the goal is to label novel classes with no training data, only detectors for attributes and a description of how those attributes are correlated with the target classes, called the class-attribute matrix. We develop the first non-trivial lower bound on the worst-case error of the best map from attributes to classes for this setting, even with perfect attribute detectors. The lower bound characterizes the theoretical intrinsic difficulty of the zero-shot problem based on the available information -- the class-attribute matrix -- and the bound is practically computable from it. Our lower bound is tight, as we show that we can always find a randomized map from attributes to classes whose expected error is upper bounded by the value of the lower bound. We show that our analysis can be predictive of how standard zero-shot methods behave in practice, including which classes will likely be confused with others.
    Grammar Detection for Sentiment Analysis through Improved Viterbi Algorithm. (arXiv:2205.13148v1 [cs.CL])
    Grammar Detection, also referred to as Parts of Speech Tagging of raw text, is considered an underlying building block of the various Natural Language Processing pipelines like named entity recognition, question answering, and sentiment analysis. In short, forgiven a sentence, Parts of Speech tagging is the task of specifying and tagging each word of a sentence with nouns, verbs, adjectives, adverbs, and more. Sentiment Analysis may well be a procedure accustomed to determining if a given sentence's emotional tone is neutral, positive or negative. To assign polarity scores to the thesis or entities within phrase, in-text analysis and analytics, machine learning and natural language processing, approaches are incorporated. This Sentiment Analysis using POS tagger helps us urge a summary of the broader public over a specific topic. For this, we are using the Viterbi algorithm, Hidden Markov Model, Constraint based Viterbi algorithm for POS tagging. By comparing the accuracies, we select the foremost accurate result of the model for Sentiment Analysis for determining the character of the sentence.
    Learning to segment with limited annotations: Self-supervised pretraining with regression and contrastive loss in MRI. (arXiv:2205.13109v1 [cs.CV])
    Obtaining manual annotations for large datasets for supervised training of deep learning (DL) models is challenging. The availability of large unlabeled datasets compared to labeled ones motivate the use of self-supervised pretraining to initialize DL models for subsequent segmentation tasks. In this work, we consider two pre-training approaches for driving a DL model to learn different representations using: a) regression loss that exploits spatial dependencies within an image and b) contrastive loss that exploits semantic similarity between pairs of images. The effect of pretraining techniques is evaluated in two downstream segmentation applications using Magnetic Resonance (MR) images: a) liver segmentation in abdominal T2-weighted MR images and b) prostate segmentation in T2-weighted MR images of the prostate. We observed that DL models pretrained using self-supervision can be finetuned for comparable performance with fewer labeled datasets. Additionally, we also observed that initializing the DL model using contrastive loss based pretraining performed better than the regression loss.
    Independent Asymmetric Embedding for Information Diffusion Prediction on Social Networks. (arXiv:2105.08291v6 [cs.LG] UPDATED)
    The prediction for information diffusion on social networks has great practical significance in marketing and public opinion control. It aims to predict the individuals who will potentially repost the message on the social network. One type of method is based on demographics, complex networks and other prior knowledge to establish an interpretable model to simulate and predict the propagation process, while the other type of method is completely data-driven and maps the nodes to a latent space for propagation prediction. Existing latent space design and embedding methods lack consideration for the intervene among users. In this paper, we propose an independent asymmetric embedding method to embed each individual into one latent influence space and multiple latent susceptibility spaces. Based on the similarity between information diffusion and heat diffusion phenomenon, the heat diffusion kernel is exploited in our model and establishes the embedding rules. Furthermore, our method captures the co-occurrence regulation of user combinations in cascades to improve the calculating effectiveness. The results of extensive experiments conducted on real-world datasets verify both the predictive accuracy and cost-effectiveness of our approach.
    Deep-XFCT: Deep learning 3D-mineral liberation analysis with micro X-ray fluorescence and computed tomography. (arXiv:2205.13102v1 [cs.LG])
    The rapid development of X-ray micro-computed tomography (micro-CT) opens new opportunities for 3D analysis of particle and grain-size characterisation, determination of particle densities and shape factors, estimation of mineral associations and liberation and locking. Current practices in mineral liberation analysis are based on 2D representations leading to systematic errors in the extrapolation to volumetric properties. New quantitative methods based on tomographic data are therefore urgently required for characterisation of mineral deposits, mineral processing, characterisation of tailings, rock typing, stratigraphic refinement, reservoir characterisation for applications in the resource industry, environmental and material sciences. To date, no simple non-destructive method exists for 3D mineral liberation analysis. We present a new development based on combining micro-CT with micro-X-ray fluorescence (micro-XRF) using deep learning. We demonstrate successful semi-automated multi-modal analysis of a crystalline magmatic rock where the new technique overcomes the difficult task of differentiating feldspar from quartz in micro-CT data set. The approach is universal and can be extended to any multi-modal and multi-instrument analysis for further refinement. We conclude that the combination of micro-CT and micro-XRF already provides a new opportunity for robust 3D mineral liberation analysis in both field and laboratory applications.
    Contextual Pandora's Box. (arXiv:2205.13114v1 [cs.LG])
    Pandora's Box is a fundamental stochastic optimization problem, where the decision-maker must find a good alternative while minimizing the search cost of exploring the value of each alternative. In the original formulation, it is assumed that accurate priors are given for the values of all the alternatives, while recent work studies the online variant of Pandora's Box where priors are originally unknown. In this work, we extend Pandora's Box to the online setting, while incorporating context. At every round, we are presented with a number of alternatives each having a context, an exploration cost and an unknown value drawn from an unknown prior distribution that may change at every round. Our main result is a no-regret algorithm that performs comparably well to the optimal algorithm which knows all prior distributions exactly. Our algorithm works even in the bandit setting where the algorithm never learns the values of the alternatives that were not explored. The key technique that enables our result is novel a modification of the realizability condition in contextual bandits that connects a context to the reservation value of the corresponding distribution rather than its mean
    Unsupervised Reinforcement Adaptation for Class-Imbalanced Text Classification. (arXiv:2205.13139v1 [cs.CL])
    Class imbalance naturally exists when train and test models in different domains. Unsupervised domain adaptation (UDA) augments model performance with only accessible annotations from the source domain and unlabeled data from the target domain. However, existing state-of-the-art UDA models learn domain-invariant representations and evaluate primarily on class-balanced data across domains. In this work, we propose an unsupervised domain adaptation approach via reinforcement learning that jointly leverages feature variants and imbalanced labels across domains. We experiment with the text classification task for its easily accessible datasets and compare the proposed method with five baselines. Experiments on three datasets prove that our proposed method can effectively learn robust domain-invariant representations and successfully adapt text classifiers on imbalanced classes over domains. The code is available at https://github.com/woqingdoua/ImbalanceClass.
    Leveraging Dependency Grammar for Fine-Grained Offensive Language Detection using Graph Convolutional Networks. (arXiv:2205.13164v1 [cs.CL])
    The last few years have witnessed an exponential rise in the propagation of offensive text on social media. Identification of this text with high precision is crucial for the well-being of society. Most of the existing approaches tend to give high toxicity scores to innocuous statements (e.g., "I am a gay man"). These false positives result from over-generalization on the training data where specific terms in the statement may have been used in a pejorative sense (e.g., "gay"). Emphasis on such words alone can lead to discrimination against the classes these systems are designed to protect. In this paper, we address the problem of offensive language detection on Twitter, while also detecting the type and the target of the offence. We propose a novel approach called SyLSTM, which integrates syntactic features in the form of the dependency parse tree of a sentence and semantic features in the form of word embeddings into a deep learning architecture using a Graph Convolutional Network. Results show that the proposed approach significantly outperforms the state-of-the-art BERT model with orders of magnitude fewer number of parameters.
    Towards Green AI with tensor networks -- Sustainability and innovation enabled by efficient algorithms. (arXiv:2205.12961v1 [cs.LG])
    The current standard to compare the performance of AI algorithms is mainly based on one criterion: the model's accuracy. In this context, algorithms with a higher accuracy (or similar measures) are considered as better. To achieve new state-of-the-art results, algorithmic development is accompanied by an exponentially increasing amount of compute. While this has enabled AI research to achieve remarkable results, AI progress comes at a cost: it is unsustainable. In this paper, we present a promising tool for sustainable and thus Green AI: tensor networks (TNs). Being an established tool from multilinear algebra, TNs have the capability to improve efficiency without compromising accuracy. Since they can reduce compute significantly, we would like to highlight their potential for Green AI. We elaborate in both a kernel machine and deep learning setting how efficiency gains can be achieved with TNs. Furthermore, we argue that better algorithms should be evaluated in terms of both accuracy and efficiency. To that end, we discuss different efficiency criteria and analyze efficiency in an exemplifying experimental setting for kernel ridge regression. With this paper, we want to raise awareness about Green AI and showcase its positive impact on sustainability and AI research. Our key contribution is to demonstrate that TNs enable efficient algorithms and therefore contribute towards Green AI. In this sense, TNs pave the way for better algorithms in AI.
    Formalizing Preferences Over Runtime Distributions. (arXiv:2205.13028v1 [cs.AI])
    When trying to solve a computational problem we are often faced with a choice among algorithms that are all guaranteed to return the right answer but that differ in their runtime distributions (e.g., SAT solvers, sorting algorithms). This paper aims to lay theoretical foundations for such choices by formalizing preferences over runtime distributions. It might seem that we should simply prefer the algorithm that minimizes expected runtime. However, such preferences would be driven by exactly how slow our algorithm is on bad inputs, whereas in practice we are typically willing to cut off occasional, sufficiently long runs before they finish. We propose a principled alternative, taking a utility-theoretic approach to characterize the scoring functions that describe preferences over algorithms. These functions depend on the way our value for solving our problem decreases with time and on the distribution from which captimes are drawn. We describe examples of realistic utility functions and show how to leverage a maximum-entropy approach for modeling underspecified captime distributions. Finally, we show how to efficiently estimate an algorithm's expected utility from runtime samples.
    Uniform Generalization Bound on Time and Inverse Temperature for Gradient Descent Algorithm and its Application to Analysis of Simulated Annealing. (arXiv:2205.12959v1 [cs.LG])
    In this paper, we propose a novel uniform generalization bound on the time and inverse temperature for stochastic gradient Langevin dynamics (SGLD) in a non-convex setting. While previous works derive their generalization bounds by uniform stability, we use Rademacher complexity to make our generalization bound independent of the time and inverse temperature. Using Rademacher complexity, we can reduce the problem to derive a generalization bound on the whole space to that on a bounded region and therefore can remove the effect of the time and inverse temperature from our generalization bound. As an application of our generalization bound, an evaluation on the effectiveness of the simulated annealing in a non-convex setting is also described. For the sample size $n$ and time $s$, we derive evaluations with orders $\sqrt{n^{-1} \log (n+1)}$ and $|(\log)^4(s)|^{-1}$, respectively. Here, $(\log)^4$ denotes the $4$ times composition of the logarithmic function.
    Concurrent Neural Tree and Data Preprocessing AutoML for Image Classification. (arXiv:2205.13033v1 [cs.LG])
    Deep Neural Networks (DNN's) are a widely-used solution for a variety of machine learning problems. However, it is often necessary to invest a significant amount of a data scientist's time to pre-process input data, test different neural network architectures, and tune hyper-parameters for optimal performance. Automated machine learning (autoML) methods automatically search the architecture and hyper-parameter space for optimal neural networks. However, current state-of-the-art (SOTA) methods do not include traditional methods for manipulating input data as part of the algorithmic search space. We adapt the Evolutionary Multi-objective Algorithm Design Engine (EMADE), a multi-objective evolutionary search framework for traditional machine learning methods, to perform neural architecture search. We also integrate EMADE's signal processing and image processing primitives. These primitives allow EMADE to manipulate input data before ingestion into the simultaneously evolved DNN. We show that including these methods as part of the search space shows potential to provide benefits to performance on the CIFAR-10 image classification benchmark dataset.
    Variance-Aware Sparse Linear Bandits. (arXiv:2205.13450v1 [cs.LG])
    It is well-known that the worst-case minimax regret for sparse linear bandits is $\widetilde{\Theta}\left(\sqrt{dT}\right)$ where $d$ is the ambient dimension and $T$ is the number of time steps (ignoring the dependency on sparsity). On the other hand, in the benign setting where there is no noise and the action set is the unit sphere, one can use divide-and-conquer to achieve an $\widetilde{\mathcal O}(1)$ regret, which is (nearly) independent of $d$ and $T$. In this paper, we present the first variance-aware regret guarantee for sparse linear bandits: $\widetilde{\mathcal O}\left(\sqrt{d\sum_{t=1}^T \sigma_t^2} + 1\right)$, where $\sigma_t^2$ is the variance of the noise at the $t$-th time step. This bound naturally interpolates the regret bounds for the worst-case constant-variance regime ($\sigma_t = \Omega(1)$) and the benign deterministic regimes ($\sigma_t = 0$). To achieve this variance-aware regret guarantee, we develop a general framework that converts any variance-aware linear bandit algorithm to a variance-aware algorithm for sparse linear bandits in a ``black-box'' manner. Specifically, we take two recent algorithms as black boxes to illustrate that the claimed bounds indeed hold, where the first algorithm can handle unknown-variance cases and the second one is more efficient.
    Learning the spatio-temporal relationship between wind and significant wave height using deep learning. (arXiv:2205.13325v1 [stat.ML])
    Ocean wave climate has a significant impact on near-shore and off-shore human activities, and its characterisation can help in the design of ocean structures such as wave energy converters and sea dikes. Therefore, engineers need long time series of ocean wave parameters. Numerical models are a valuable source of ocean wave data; however, they are computationally expensive. Consequently, statistical and data-driven approaches have gained increasing interest in recent decades. This work investigates the spatio-temporal relationship between North Atlantic wind and significant wave height (Hs) at an off-shore location in the Bay of Biscay, using a two-stage deep learning model. The first step uses convolutional neural networks (CNNs) to extract the spatial features that contribute to Hs. Then, long short-term memory (LSTM) is used to learn the long-term temporal dependencies between wind and waves.
    Towards Symbolic Time Series Representation Improved by Kernel Density Estimators. (arXiv:2205.12960v1 [cs.LG])
    This paper deals with symbolic time series representation. It builds up on the popular mapping technique Symbolic Aggregate approXimation algorithm (SAX), which is extensively utilized in sequence classification, pattern mining, anomaly detection, time series indexing and other data mining tasks. However, the disadvantage of this method is, that it works reliably only for time series with Gaussian-like distribution. In our previous work we have proposed an improvement of SAX, called dwSAX, which can deal with Gaussian as well as non-Gaussian data distribution. Recently we have made further progress in our solution - edwSAX. Our goal was to optimally cover the information space by means of sufficient alphabet utilization; and to satisfy lower bounding criterion as tight as possible. We describe here our approach, including evaluation on commonly employed tasks such as time series reconstruction error and Euclidean distance lower bounding with promising improvements over SAX.  ( 2 min )
    QADAM: Quantization-Aware DNN Accelerator Modeling for Pareto-Optimality. (arXiv:2205.13045v1 [cs.AR])
    As the machine learning and systems communities strive to achieve higher energy-efficiency through custom deep neural network (DNN) accelerators, varied bit precision or quantization levels, there is a need for design space exploration frameworks that incorporate quantization-aware processing elements (PE) into the accelerator design space while having accurate and fast power, performance, and area models. In this work, we present QADAM, a highly parameterized quantization-aware power, performance, and area modeling framework for DNN accelerators. Our framework can facilitate future research on design space exploration and Pareto-efficiency of DNN accelerators for various design choices such as bit precision, PE type, scratchpad sizes of PEs, global buffer size, number of total PEs, and DNN configurations. Our results show that different bit precisions and PE types lead to significant differences in terms of performance per area and energy. Specifically, our framework identifies a wide range of design points where performance per area and energy varies more than 5x and 35x, respectively. We also show that the proposed lightweight processing elements (LightPEs) consistently achieve Pareto-optimal results in terms of accuracy and hardware-efficiency. With the proposed framework, we show that LightPEs achieve on par accuracy results and up to 5.7x more performance per area and energy improvement when compared to the best INT16 based design.
    RACE: A Reinforcement Learning Framework for Improved Adaptive Control of NoC Channel Buffers. (arXiv:2205.13130v1 [cs.AR])
    Network-on-chip (NoC) architectures rely on buffers to store flits to cope with contention for router resources during packet switching. Recently, reversible multi-function channel (RMC) buffers have been proposed to simultaneously reduce power and enable adaptive NoC buffering between adjacent routers. While adaptive buffering can improve NoC performance by maximizing buffer utilization, controlling the RMC buffer allocations requires a congestion-aware, scalable, and proactive policy. In this work, we present RACE, a novel reinforcement learning (RL) framework that utilizes better awareness of network congestion and a new reward metric ("falsefulls") to help guide the RL agent towards better RMC buffer control decisions. We show that RACE reduces NoC latency by up to 48.9%, and energy consumption by up to 47.1% against state-of-the-art NoC buffer control policies.
  • Open

    Ranking the information content of distance measures. (arXiv:2104.15079v2 [stat.ML] UPDATED)
    Real-world data typically contain a large number of features that are often heterogeneous in nature, relevance, and also units of measure. When assessing the similarity between data points, one can build various distance measures using subsets of these features. Using the fewest features but still retaining sufficient information about the system is crucial in many statistical learning approaches, particularly when data are sparse. We introduce a statistical test that can assess the relative information retained when using two different distance measures, and determine if they are equivalent, independent, or if one is more informative than the other. This in turn allows finding the most informative distance measure out of a pool of candidates. The approach is applied to find the most relevant policy variables for controlling the Covid-19 epidemic and to find compact yet informative representations of atomic structures, but its potential applications are wide ranging in many branches of science.  ( 2 min )
    Machine Learning Construction: implications to cybersecurity. (arXiv:1906.10019v3 [cs.LG] UPDATED)
    Statistical learning is the process of estimating an unknown probabilistic input-output relationship of a system using a limited number of observations; a statistical learning machine (SLM) is the algorithm, function, model, or rule, that learns such a process; and machine learning (ML) is the conventional name of this field. ML and its applications are ubiquitous in the modern world. Cyberphysical systems such as Automatic target recognition (ATR) in military applications, computer aided diagnosis (CAD) in medical imaging, DNA microarrays in genomics, optical character recognition (OCR), speech recognition (SR), spam email filtering, stock market prediction, etc., are few examples and applications for ML; diverse fields but one theory. In particular, ML has gained a lot of attention in the field of cyberphysical security, especially in the last decade. It is of great importance to this field to design detection algorithms that have the capability of learning from security data to be able to hunt threats, achieve better monitoring, master the complexity of the threat intelligence feeds, and achieve timely remediation of security incidents. The field of ML can be decomposed into two basic subfields: \textit{construction} and \textit{assessment}. We mean by \textit{construction} designing or inventing an appropriate algorithm that learns from the input data and achieves a good performance according to some optimality criterion. We mean by \textit{assessment} attributing some performance measures to the constructed ML algorithm, along with their estimators, to objectively assess this algorithm.  ( 2 min )
    Reproducibility in Optimization: Theoretical Framework and Limits. (arXiv:2202.04598v2 [math.OC] UPDATED)
    We initiate a formal study of reproducibility in optimization. We define a quantitative measure of reproducibility of optimization procedures in the face of noisy or error-prone operations such as inexact or stochastic gradient computations or inexact initialization. We then analyze several convex optimization settings of interest such as smooth, non-smooth, and strongly-convex objective functions and establish tight bounds on the limits of reproducibility in each setting. Our analysis reveals a fundamental trade-off between computation and reproducibility: more computation is necessary (and sufficient) for better reproducibility.  ( 2 min )
    A Penalized Shared-parameter Algorithm for Estimating Optimal Dynamic Treatment Regimens. (arXiv:2107.07875v2 [stat.ML] UPDATED)
    A dynamic treatment regimen (DTR) is a set of decision rules to personalize treatments for an individual using their medical history. The Q-learning based Q-shared algorithm has been used to develop DTRs that involve decision rules shared across multiple stages of intervention. We show that the existing Q-shared algorithm can suffer from non-convergence due to the use of linear models in the Q-learning setup, and identify the condition in which Q-shared fails. Leveraging properties from expansion-constrained ordinary least-squares, we give a penalized Q-shared algorithm that not only converges in settings that violate the condition, but can outperform the original Q-shared algorithm even when the condition is satisfied. We give evidence for the proposed method in a real-world application and several synthetic simulations.  ( 2 min )
    Worst-case Performance of Greedy Policies in Bandits with Imperfect Context Observations. (arXiv:2204.04773v2 [stat.ML] UPDATED)
    Contextual bandits are canonical models for sequential decision-making under uncertainty in environments with time-varying components. In this setting, the expected reward of each bandit arm consists of the inner product of an unknown parameter with the context vector of that arm. The classical bandit settings heavily rely on assuming that the contexts are fully observed, while study of the richer model of imperfectly observed contextual bandits is immature. This work considers Greedy reinforcement learning policies that take actions as if the current estimates of the parameter and of the unobserved contexts coincide with the corresponding true values. We establish that the non-asymptotic worst-case regret grows poly-logarithmically with the time horizon and the failure probability, while it scales linearly with the number of arms. Numerical analysis showcasing the above efficiency of Greedy policies is also provided.  ( 2 min )
    Improved Fine-Tuning by Better Leveraging Pre-Training Data. (arXiv:2111.12292v2 [cs.CV] UPDATED)
    As a dominant paradigm, fine-tuning a pre-trained model on the target data is widely used in many deep learning applications, especially for small data sets. However, recent studies have empirically shown that training from scratch has the final performance that is no worse than this pre-training strategy once the number of training samples is increased in some vision tasks. In this work, we revisit this phenomenon from the perspective of generalization analysis by using excess risk bound which is popular in learning theory. The result reveals that the excess risk bound may have a weak dependency on the pre-trained model. The observation inspires us to leverage pre-training data for fine-tuning, since this data is also available for fine-tuning. The generalization result of using pre-training data shows that the excess risk bound on a target task can be improved when the appropriate pre-training data is included in fine-tuning. With the theoretical motivation, we propose a novel selection strategy to select a subset from pre-training data to help improve the generalization on the target task. Extensive experimental results for image classification tasks on 8 benchmark data sets verify the effectiveness of the proposed data selection based fine-tuning pipeline.  ( 2 min )
    Amplitude Mean of Functional Data on $\mathbb{S}^2$. (arXiv:2107.13721v4 [stat.ML] UPDATED)
    Manifold-valued functional data analysis (FDA) recently becomes an active area of research motivated by the raising availability of trajectories or longitudinal data observed on non-linear manifolds. The challenges of analyzing such data come from many aspects, including infinite dimensionality and nonlinearity, as well as time-domain or phase variability. In this paper, we study the amplitude part of manifold-valued functions on $\mathbb{S}^2$, which is invariant to random time warping or re-parameterization. Utilizing the nice geometry of $\mathbb{S}^2$, we develop a set of efficient and accurate tools for temporal alignment of functions, geodesic computing, and sample mean calculation. At the heart of these tools, they rely on gradient descent algorithms with carefully derived gradients. We show the advantages of these newly developed tools over its competitors with extensive simulations and real data and demonstrate the importance of considering the amplitude part of functions instead of mixing it with phase variability in manifold-valued FDA.  ( 2 min )
    Lorentzian Fully Hyperbolic Generative Adversarial Network. (arXiv:2201.12825v2 [cs.LG] UPDATED)
    With the recent advance of deep learning, neural networks have been extensively used for data in non-Euclidean domains. In particular, hyperbolic neural networks have proved successful in processing hierarchical information of data. While a variety of hyperbolic neural network structures have been proposed, they mainly focus on discriminative tasks, and generative models in the hyperbolic space have scarcely been studied. In this work, we propose a hyperbolic generative adversarial network (GAN) within the Lorentz model for generating hyperbolic data. In addition to existing hyperbolic operations, we design novel hyperbolic layers to guarantee stable training. We first use synthetic data to show that our network is able to learn simple distribution in the hyperbolic space. Moreover, by virtue of an autoencoder, we construct a neural network model, named HAEGAN, for generating more complex data in the hyperbolic space. HAEGAN contains three parts: first, a hyperbolic autoencoder; second, a hyperbolic GAN for generating the latent embedding of the autoencoder; third, a generator that inherits the decoder from autoencoder and the generator from the GAN. Experiments show that HAEGAN is able to generate complex data with state-of-the-art structure-related performance.  ( 2 min )
    Mesoscopic modeling of hidden spiking neurons. (arXiv:2205.13493v1 [q-bio.NC])
    Can we use spiking neural networks (SNN) as generative models of multi-neuronal recordings, while taking into account that most neurons are unobserved? Modeling the unobserved neurons with large pools of hidden spiking neurons leads to severely underconstrained problems that are hard to tackle with maximum likelihood estimation. In this work, we use coarse-graining and mean-field approximations to derive a bottom-up, neuronally-grounded latent variable model (neuLVM), where the activity of the unobserved neurons is reduced to a low-dimensional mesoscopic description. In contrast to previous latent variable models, neuLVM can be explicitly mapped to a recurrent, multi-population SNN, giving it a transparent biological interpretation. We show, on synthetic spike trains, that a few observed neurons are sufficient for neuLVM to perform efficient model inversion of large SNNs, in the sense that it can recover connectivity parameters, infer single-trial latent population activity, reproduce ongoing metastable dynamics, and generalize when subjected to perturbations mimicking photo-stimulation.  ( 2 min )
    Subspace clustering in high-dimensions: Phase transitions \& Statistical-to-Computational gap. (arXiv:2205.13527v1 [stat.ML])
    A simple model to study subspace clustering is the high-dimensional $k$-Gaussian mixture model where the cluster means are sparse vectors. Here we provide an exact asymptotic characterization of the statistically optimal reconstruction error in this model in the high-dimensional regime with extensive sparsity, i.e. when the fraction of non-zero components of the cluster means $\rho$, as well as the ratio $\alpha$ between the number of samples and the dimension are fixed, while the dimension diverges. We identify the information-theoretic threshold below which obtaining a positive correlation with the true cluster means is statistically impossible. Additionally, we investigate the performance of the approximate message passing (AMP) algorithm analyzed via its state evolution, which is conjectured to be optimal among polynomial algorithm for this task. We identify in particular the existence of a statistical-to-computational gap between the algorithm that require a signal-to-noise ratio $\lambda_{\text{alg}} \ge k / \sqrt{\alpha} $ to perform better than random, and the information theoretic threshold at $\lambda_{\text{it}} \approx \sqrt{-k \rho \log{\rho}} / \sqrt{\alpha}$. Finally, we discuss the case of sub-extensive sparsity $\rho$ by comparing the performance of the AMP with other sparsity-enhancing algorithms, such as sparse-PCA and diagonal thresholding.  ( 2 min )
    Gaussian Process Sampling and Optimization with Approximate Upper and Lower Bounds. (arXiv:2110.12087v3 [cs.LG] UPDATED)
    Many functions have approximately-known upper and/or lower bounds, potentially aiding the modeling of such functions. In this paper, we introduce Gaussian process models for functions where such bounds are (approximately) known. More specifically, we propose the first use of such bounds to improve Gaussian process (GP) posterior sampling and Bayesian optimization (BO). That is, we transform a GP model satisfying the given bounds, and then sample and weight functions from its posterior. To further exploit these bounds in BO settings, we present bounded entropy search (BES) to select the point gaining the most information about the underlying function, estimated by the GP samples, while satisfying the output constraints. We characterize the sample variance bounds and show that the decision made by BES is explainable. Our proposed approach is conceptually straightforward and can be used as a plug in extension to existing methods for GP posterior sampling and Bayesian optimization.  ( 2 min )
    Coherent Probabilistic Aggregate Queries on Long-horizon Forecasts. (arXiv:2111.03394v2 [cs.LG] UPDATED)
    Long range forecasts are the starting point of many decision support systems that need to draw inference from high-level aggregate patterns on forecasted values. State of the art time-series forecasting methods are either subject to concept drift on long-horizon forecasts, or fail to accurately predict coherent and accurate high-level aggregates. In this work, we present a novel probabilistic forecasting method that produces forecasts that are coherent in terms of base level and predicted aggregate statistics. We achieve the coherency between predicted base-level and aggregate statistics using a novel inference method based on KL-divergence that can be solved efficiently in closed form. We show that our method improves forecast performance across both base level and unseen aggregates post inference on real datasets ranging three diverse domains. (\href{https://github.com/pratham16cse/AggForecaster}{Project URL})  ( 2 min )
    The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer Linear Networks. (arXiv:2108.11489v2 [stat.ML] UPDATED)
    The recent success of neural network models has shone light on a rather surprising statistical phenomenon: statistical models that perfectly fit noisy data can generalize well to unseen test data. Understanding this phenomenon of $\textit{benign overfitting}$ has attracted intense theoretical and empirical study. In this paper, we consider interpolating two-layer linear neural networks trained with gradient flow on the squared loss and derive bounds on the excess risk when the covariates satisfy sub-Gaussianity and anti-concentration properties, and the noise is independent and sub-Gaussian. By leveraging recent results that characterize the implicit bias of this estimator, our bounds emphasize the role of both the quality of the initialization as well as the properties of the data covariance matrix in achieving low excess risk.  ( 2 min )
    Factor selection in screening experiments by aggregation over random models. (arXiv:2205.13497v1 [stat.ME])
    Screening experiments are useful for screening out a small number of truly important factors from a large number of potentially important factors. The Gauss-Dantzig Selector (GDS) is often the preferred analysis method for screening experiments. Just considering main-effects models can result in erroneous conclusions, but including interaction terms, even if restricted to two-factor interactions, increases the number of model terms dramatically and challenges the GDS analysis. We propose a new analysis method, called Gauss-Dantzig Selector Aggregation over Random Models (GDS-ARM), which performs a GDS analysis on multiple models that include only some randomly selected interactions. Results from these different analyses are then aggregated to identify the important factors. We discuss the proposed method, suggest choices for the tuning parameters, and study its performance on real and simulated data.  ( 2 min )
    Unsupervised Learning From Incomplete Measurements for Inverse Problems. (arXiv:2201.12151v3 [stat.ML] UPDATED)
    In many real-world inverse problems, only incomplete measurement data are available for training which can pose a problem for learning a reconstruction function. Indeed, unsupervised learning using a fixed incomplete measurement process is impossible in general, as there is no information in the nullspace of the measurement operator. This limitation can be overcome by using measurements from multiple operators. While this idea has been successfully applied in various applications, a precise characterization of the conditions for learning is still lacking. In this paper, we fill this gap by presenting necessary and sufficient conditions for learning the underlying signal model needed for reconstruction which indicate the interplay between the number of distinct measurement operators, the number of measurements per operator, the dimension of the model and the dimension of the signals. Furthermore, we propose a novel and conceptually simple unsupervised learning loss which only requires access to incomplete measurement data and achieves a performance on par with supervised learning when the sufficient condition is verified. We validate our theoretical bounds and demonstrate the advantages of the proposed unsupervised loss compared to previous methods via a series of experiments on various imaging inverse problems, such as accelerated magnetic resonance imaging, compressed sensing and image inpainting.  ( 2 min )
    When Doubly Robust Methods Meet Machine Learning for Estimating Treatment Effects from Real-World Data: A Comparative Study. (arXiv:2204.10969v2 [stat.ME] UPDATED)
    Observational cohort studies are increasingly being used for comparative effectiveness research to assess the safety of therapeutics. Recently, various doubly robust methods have been proposed for average treatment effect estimation by combining the treatment model and the outcome model via different vehicles, such as matching, weighting, and regression. The key advantage of doubly robust estimators is that they require either the treatment model or the outcome model to be correctly specified to obtain a consistent estimator of average treatment effects, and therefore lead to a more accurate and often more precise inference. However, little work has been done to understand how doubly robust estimators differ due to their unique strategies of using the treatment and outcome models and how machine learning techniques can be combined with these estimators to boost their performance. Also, little has been understood about the challenges of covariates selection, overlapping of covariate distribution, and treatment effect heterogeneity on the performance of these doubly robust estimators. Here we examine multiple popular doubly robust methods in the categories of matching, weighting, or regression, and compare their performance using different treatment and outcome modeling via extensive simulations and a real-world application. We found that incorporating machine learning with doubly robust estimators such as the targeted maximum likelihood estimator gives the best overall performance. Practical guidance on how to apply doubly robust estimators is provided.  ( 2 min )
    Selective Classification Via Neural Network Training Dynamics. (arXiv:2205.13532v1 [cs.LG])
    Selective classification is the task of rejecting inputs a model would predict incorrectly on through a trade-off between input space coverage and model accuracy. Current methods for selective classification impose constraints on either the model architecture or the loss function; this inhibits their usage in practice. In contrast to prior work, we show that state-of-the-art selective classification performance can be attained solely from studying the (discretized) training dynamics of a model. We propose a general framework that, for a given test input, monitors metrics capturing the disagreement with the final predicted label over intermediate models obtained during training; we then reject data points exhibiting too much disagreement at late stages in training. In particular, we instantiate a method that tracks when the label predicted during training stops disagreeing with the final predicted label. Our experimental evaluation shows that our method achieves state-of-the-art accuracy/coverage trade-offs on typical selective classification benchmarks. For example, we improve coverage on CIFAR-10/SVHN by 10.1%/1.5% respectively at a fixed target error of 0.5%.  ( 2 min )
    Forest Fire Clustering for Single-cell Sequencing with Iterative Label Propagation and Parallelized Monte Carlo Simulation. (arXiv:2103.11802v4 [cs.LG] UPDATED)
    In the era of single-cell sequencing, there is a growing need to extract insights from data with clustering methods. Here, we introduce Forest Fire Clustering, an efficient and interpretable method for cell-type discovery from single-cell data. Forest Fire Clustering makes minimal prior assumptions and, different from current approaches, calculates a non-parametric posterior probability that each cell is assigned a cell-type label. These posterior distributions allow for the evaluation of a label confidence for each cell and enable the computation of "label entropies," highlighting transitions along developmental trajectories. Furthermore, we show that Forest Fire Clustering can make robust, inductive inferences in an online-learning context and can readily scale to millions of cells. Finally, we demonstrate that our method outperforms state-of-the-art clustering approaches on diverse benchmarks of simulated and experimental data. Overall, Forest Fire Clustering is a useful tool for rare cell type discovery in large-scale single-cell analysis.  ( 2 min )
    Training ReLU networks to high uniform accuracy is intractable. (arXiv:2205.13531v1 [cs.LG])
    Statistical learning theory provides bounds on the necessary number of training samples needed to reach a prescribed accuracy in a learning problem formulated over a given target class. This accuracy is typically measured in terms of a generalization error, that is, an expected value of a given loss function. However, for several applications -- for example in a security-critical context or for problems in the computational sciences -- accuracy in this sense is not sufficient. In such cases, one would like to have guarantees for high accuracy on every input value, that is, with respect to the uniform norm. In this paper we precisely quantify the number of training samples needed for any conceivable training algorithm to guarantee a given uniform accuracy on any learning problem formulated over target classes containing (or consisting of) ReLU neural networks of a prescribed architecture. We prove that, under very general assumptions, the minimal number of training samples for this task scales exponentially both in the depth and the input dimension of the network architecture. As a corollary we conclude that the training of ReLU neural networks to high uniform accuracy is intractable. In a security-critical context this points to the fact that deep learning based systems are prone to being fooled by a possible adversary. We corroborate our theoretical findings by numerical results.  ( 2 min )
    Contrastive and Non-Contrastive Self-Supervised Learning Recover Global and Local Spectral Embedding Methods. (arXiv:2205.11508v2 [cs.LG] UPDATED)
    Self-Supervised Learning (SSL) surmises that inputs and pairwise positive relationships are enough to learn meaningful representations. Although SSL has recently reached a milestone: outperforming supervised methods in many modalities\dots the theoretical foundations are limited, method-specific, and fail to provide principled design guidelines to practitioners. In this paper, we propose a unifying framework under the helm of spectral manifold learning to address those limitations. Through the course of this study, we will rigorously demonstrate that VICReg, SimCLR, BarlowTwins et al. correspond to eponymous spectral methods such as Laplacian Eigenmaps, Multidimensional Scaling et al. This unification will then allow us to obtain (i) the closed-form optimal representation for each method, (ii) the closed-form optimal network parameters in the linear regime for each method, (iii) the impact of the pairwise relations used during training on each of those quantities and on downstream task performances, and most importantly, (iv) the first theoretical bridge between contrastive and non-contrastive methods towards global and local spectral embedding methods respectively, hinting at the benefits and limitations of each. For example, (i) if the pairwise relation is aligned with the downstream task, any SSL method can be employed successfully and will recover the supervised method, but in the low data regime, VICReg's invariance hyper-parameter should be high; (ii) if the pairwise relation is misaligned with the downstream task, VICReg with small invariance hyper-parameter should be preferred over SimCLR or BarlowTwins.
    The Neural Testbed: Evaluating Joint Predictions. (arXiv:2110.04629v3 [cs.LG] UPDATED)
    Predictive distributions quantify uncertainties ignored by point estimates. This paper introduces \textit{The Neural Testbed}: an open-source benchmark for controlled and principled evaluation of agents that generate such predictions. Crucially, the testbed assesses agents not only on the quality of their marginal predictions per input, but also on their joint predictions across many inputs. We evaluate a range of agents using a simple neural network data generating process. Our results indicate that some popular Bayesian deep learning agents do not fare well with joint predictions, even when they can produce accurate marginal predictions. We also show that the quality of joint predictions drives performance in downstream decision tasks. We find these results are robust across choice a wide range of generative models, and highlight the practical importance of joint predictions to the community.
    Censored Quantile Regression Neural Networks. (arXiv:2205.13496v1 [stat.ML])
    This paper considers doing quantile regression on censored data using neural networks (NNs). This adds to the survival analysis toolkit by allowing direct prediction of the target variable, along with a distribution-free characterisation of uncertainty, using a flexible function approximator. We begin by showing how an algorithm popular in linear models can be applied to NNs. However, the resulting procedure is inefficient, requiring sequential optimisation of an individual NN at each desired quantile. Our major contribution is a novel algorithm that simultaneously optimises a grid of quantiles output by a single NN. To offer theoretical insight into our algorithm, we show firstly that it can be interpreted as a form of expectation-maximisation, and secondly that it exhibits a desirable `self-correcting' property. Experimentally, the algorithm produces quantiles that are better calibrated than existing methods on 10 out of 12 real datasets.
    A framework for overparameterized learning. (arXiv:2205.13507v1 [cs.LG])
    An explanation for the success of deep neural networks is a central question in theoretical machine learning. According to classical statistical learning, the overparameterized nature of such models should imply a failure to generalize. Many argue that good empirical performance is due to the implicit regularization of first order optimization methods. In particular, the Polyak-{\L}ojasiewicz condition leads to gradient descent finding a global optimum that is close to initialization. In this work, we propose a framework consisting of a prototype learning problem, which is general enough to cover many popular problems and even the cases of infinitely wide neural networks and infinite data. We then perform an analysis from the perspective of the Polyak-{\L}ojasiewicz condition. We obtain theoretical results of independent interest, concerning gradient descent on a composition $(f \circ F): G \to \mathbb{R}$ of functions $F: G \to H$ and $f: H \to \mathbb{R}$ with $G, H$ being Hilbert spaces. Building on these results, we determine the properties that have to be satisfied by the components of the prototype problem for gradient descent to find a global optimum that is close to initialization. We then demonstrate that supervised learning, variational autoencoders and training with gradient penalty can be translated to the prototype problem. Finally, we lay out a number of directions for future research.
    RKHS-SHAP: Shapley Values for Kernel Methods. (arXiv:2110.09167v2 [stat.ML] UPDATED)
    Feature attribution for kernel methods is often heuristic and not individualised for each prediction. To address this, we turn to the concept of Shapley values~(SV), a coalition game theoretical framework that has previously been applied to different machine learning model interpretation tasks, such as linear models, tree ensembles and deep networks. By analysing SVs from a functional perspective, we propose \textsc{RKHS-SHAP}, an attribution method for kernel machines that can efficiently compute both \emph{Interventional} and \emph{Observational Shapley values} using kernel mean embeddings of distributions. We show theoretically that our method is robust with respect to local perturbations - a key yet often overlooked desideratum for consistent model interpretation. Further, we propose \emph{Shapley regulariser}, applicable to a general empirical risk minimisation framework, allowing learning while controlling the level of specific feature's contributions to the model. We demonstrate that the Shapley regulariser enables learning which is robust to covariate shift of a given feature and fair learning which controls the SVs of sensitive features.
    Towards the Generalization of Contrastive Self-Supervised Learning. (arXiv:2111.00743v3 [cs.LG] UPDATED)
    Recently, self-supervised learning has attracted great attention, since it only requires unlabeled data for training. Contrastive learning is one popular method for self-supervised learning and has achieved promising empirical performance. However, the theoretical understanding of its generalization ability is still limited. To this end, we define a kind of $(\sigma,\delta)$-measure to mathematically quantify the data augmentation, and then provide an upper bound of the downstream classification error based on the measure. We show that the generalization ability of contrastive self-supervised learning depends on three key factors: alignment of positive samples, divergence of class centers, and concentration of augmented data. The first two factors can be optimized by contrastive algorithms, while the third one is priorly determined by pre-defined data augmentation. With the above theoretical findings, we further study two canonical contrastive losses, InfoNCE and cross-correlation loss, and prove that both of them are indeed able to satisfy the first two factors. Moreover, we empirically verify the third factor by conducting various experiments on the real-world dataset, and show that our theoretical inferences on the relationship between the data augmentation and the generalization of contrastive self-supervised learning agree with the empirical observations.
    Machine Learning Assessment: implications to cybersecurity. (arXiv:1907.12851v4 [stat.ML] UPDATED)
    This chapter is dedicated to the assessment and performance estimation of machine learning (ML) algorithms, a topic that is equally important to the construction of these algorithms, in particular in the context of cyberphysical security design. The literature is full of nonparametric methods to estimate a statistic from just one available dataset through resampling techniques, e.g., jackknife, bootstrap and cross validation (CV). Special statistics of great interest are the error rate and the area under the ROC curve (AUC) of a classification rule. The importance of these resampling methods stems from the fact that they require no knowledge about the probability distribution of the data or the construction details of the ML algorithm. This chapter provides a concise review of this literature to establish a coherent theoretical framework for these methods that can estimate both the error rate (a one-sample statistic) and the AUC (a two-sample statistic). The resampling methods are usually computationally expensive, because they rely on repeating the training and testing of a ML algorithm after each resampling iteration. Therefore, the practical applicability of some of these methods may be limited to the traditional ML algorithms rather than the very computationally demanding approaches of the recent deep neural networks (DNN). In the field of cyberphysical security, many applications generate structured (tabular) data, which can be fed to all traditional ML approaches. This is in contrast to the DNN approaches, which favor unstructured data, e.g., images, text, voice, etc.; hence, the relevance of this chapter to this field.%
    Epistemic Neural Networks. (arXiv:2107.08924v3 [cs.LG] UPDATED)
    Intelligence relies on an agent's knowledge of what it does not know. This capability can be assessed based on the quality of \textit{joint} predictions of labels across multiple inputs. Conventional neural networks lack this capability and, since most research has focused on marginal predictions, this shortcoming has been largely overlooked. We introduce the \textit{epistemic neural network} (ENN) as an interface for models that represent uncertainty as required to generate useful joint predictions. While prior approaches to uncertainty modeling such as Bayesian neural networks can be expressed as ENNs, this new interface facilitates comparison of joint predictions and the design of novel architectures and algorithms. In particular, we introduce the \textit{epinet}: an architecture that can supplement any conventional neural network, including large pretrained models, and can be trained with modest incremental computation to estimate uncertainty. With an epinet, conventional neural networks outperform very large ensembles, consisting of hundreds or more particles, with orders of magnitude less computation. We demonstrate this efficacy across synthetic data, ImageNet, and some reinforcement learning tasks. As part of this effort we open-source experiment code.
    Gaussian Universality of Linear Classifiers with Random Labels in High-Dimension. (arXiv:2205.13303v1 [stat.ML])
    While classical in many theoretical settings, the assumption of Gaussian i.i.d. inputs is often perceived as a strong limitation in the analysis of high-dimensional learning. In this study, we redeem this line of work in the case of generalized linear classification with random labels. Our main contribution is a rigorous proof that data coming from a range of generative models in high-dimensions have the same minimum training loss as Gaussian data with corresponding data covariance. In particular, our theorem covers data created by an arbitrary mixture of homogeneous Gaussian clouds, as well as multi-modal generative neural networks. In the limit of vanishing regularization, we further demonstrate that the training loss is independent of the data covariance. Finally, we show that this universality property is observed in practice with real datasets and random labels.  ( 2 min )
    Distributed Contextual Linear Bandits with Minimax Optimal Communication Cost. (arXiv:2205.13170v1 [cs.LG])
    We study distributed contextual linear bandits with stochastic contexts, where $N$ agents act cooperatively to solve a linear bandit-optimization problem with $d$-dimensional features. For this problem, we propose a distributed batch elimination version of the LinUCB algorithm, DisBE-LUCB, where the agents share information among each other through a central server. We prove that over $T$ rounds ($NT$ actions in total) the communication cost of DisBE-LUCB is only $\tilde{\mathcal{O}}(dN)$ and its regret is at most $\tilde{\mathcal{O}}(\sqrt{dNT})$, which is of the same order as that incurred by an optimal single-agent algorithm for $NT$ rounds. Remarkably, we derive an information-theoretic lower bound on the communication cost of the distributed contextual linear bandit problem with stochastic contexts, and prove that our proposed algorithm is nearly minimax optimal in terms of \emph{both regret and communication cost}. Finally, we propose DecBE-LUCB, a fully decentralized version of DisBE-LUCB, which operates without a central server, where agents share information with their \emph{immediate neighbors} through a carefully designed consensus procedure.  ( 2 min )
    Towards Learning Universal Hyperparameter Optimizers with Transformers. (arXiv:2205.13320v1 [cs.LG])
    Meta-learning hyperparameter optimization (HPO) algorithms from prior experiments is a promising approach to improve optimization efficiency over objective functions from a similar distribution. However, existing methods are restricted to learning from experiments sharing the same set of hyperparameters. In this paper, we introduce the OptFormer, the first text-based Transformer HPO framework that provides a universal end-to-end interface for jointly learning policy and function prediction when trained on vast tuning data from the wild. Our extensive experiments demonstrate that the OptFormer can imitate at least 7 different HPO algorithms, which can be further improved via its function uncertainty estimates. Compared to a Gaussian Process, the OptFormer also learns a robust prior distribution for hyperparameter response functions, and can thereby provide more accurate and better calibrated predictions. This work paves the path to future extensions for training a Transformer-based model as a general HPO optimizer.  ( 2 min )
    Your Transformer May Not be as Powerful as You Expect. (arXiv:2205.13401v1 [cs.LG])
    Relative Positional Encoding (RPE), which encodes the relative distance between any pair of tokens, is one of the most successful modifications to the original Transformer. As far as we know, theoretical understanding of the RPE-based Transformers is largely unexplored. In this work, we mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions. One may naturally assume the answer is in the affirmative -- RPE-based Transformers are universal function approximators. However, we present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is. One key reason lies in that most RPEs are placed in the softmax attention that always generates a right stochastic matrix. This restricts the network from capturing positional information in the RPEs and limits its capacity. To overcome the problem and make the model more powerful, we first present sufficient conditions for RPE-based Transformers to achieve universal function approximation. With the theoretical guidance, we develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions. Therefore, the corresponding URPE-based Transformers become universal function approximators. Extensive experiments covering typical architectures and tasks demonstrate that our model is parameter-efficient and can achieve superior performance to strong baselines in a wide range of applications.  ( 2 min )
    Variance-Aware Sparse Linear Bandits. (arXiv:2205.13450v1 [cs.LG])
    It is well-known that the worst-case minimax regret for sparse linear bandits is $\widetilde{\Theta}\left(\sqrt{dT}\right)$ where $d$ is the ambient dimension and $T$ is the number of time steps (ignoring the dependency on sparsity). On the other hand, in the benign setting where there is no noise and the action set is the unit sphere, one can use divide-and-conquer to achieve an $\widetilde{\mathcal O}(1)$ regret, which is (nearly) independent of $d$ and $T$. In this paper, we present the first variance-aware regret guarantee for sparse linear bandits: $\widetilde{\mathcal O}\left(\sqrt{d\sum_{t=1}^T \sigma_t^2} + 1\right)$, where $\sigma_t^2$ is the variance of the noise at the $t$-th time step. This bound naturally interpolates the regret bounds for the worst-case constant-variance regime ($\sigma_t = \Omega(1)$) and the benign deterministic regimes ($\sigma_t = 0$). To achieve this variance-aware regret guarantee, we develop a general framework that converts any variance-aware linear bandit algorithm to a variance-aware algorithm for sparse linear bandits in a ``black-box'' manner. Specifically, we take two recent algorithms as black boxes to illustrate that the claimed bounds indeed hold, where the first algorithm can handle unknown-variance cases and the second one is more efficient.  ( 2 min )
    Uniform Generalization Bound on Time and Inverse Temperature for Gradient Descent Algorithm and its Application to Analysis of Simulated Annealing. (arXiv:2205.12959v1 [cs.LG])
    In this paper, we propose a novel uniform generalization bound on the time and inverse temperature for stochastic gradient Langevin dynamics (SGLD) in a non-convex setting. While previous works derive their generalization bounds by uniform stability, we use Rademacher complexity to make our generalization bound independent of the time and inverse temperature. Using Rademacher complexity, we can reduce the problem to derive a generalization bound on the whole space to that on a bounded region and therefore can remove the effect of the time and inverse temperature from our generalization bound. As an application of our generalization bound, an evaluation on the effectiveness of the simulated annealing in a non-convex setting is also described. For the sample size $n$ and time $s$, we derive evaluations with orders $\sqrt{n^{-1} \log (n+1)}$ and $|(\log)^4(s)|^{-1}$, respectively. Here, $(\log)^4$ denotes the $4$ times composition of the logarithmic function.  ( 2 min )
    Follow-the-Perturbed-Leader for Adversarial Markov Decision Processes with Bandit Feedback. (arXiv:2205.13451v1 [cs.LG])
    We consider regret minimization for Adversarial Markov Decision Processes (AMDPs), where the loss functions are changing over time and adversarially chosen, and the learner only observes the losses for the visited state-action pairs (i.e., bandit feedback). While there has been a surge of studies on this problem using Online-Mirror-Descent (OMD) methods, very little is known about the Follow-the-Perturbed-Leader (FTPL) methods, which are usually computationally more efficient and also easier to implement since it only requires solving an offline planning problem. Motivated by this, we take a closer look at FTPL for learning AMDPs, starting from the standard episodic finite-horizon setting. We find some unique and intriguing difficulties in the analysis and propose a workaround to eventually show that FTPL is also able to achieve near-optimal regret bounds in this case. More importantly, we then find two significant applications: First, the analysis of FTPL turns out to be readily generalizable to delayed bandit feedback with order-optimal regret, while OMD methods exhibit extra difficulties (Jin et al., 2022). Second, using FTPL, we also develop the first no-regret algorithm for learning communicating AMDPs in the infinite-horizon setting with bandit feedback and stochastic transitions. Our algorithm is efficient assuming access to an offline planning oracle, while even for the easier full-information setting, the only existing algorithm (Chandrasekaran and Tewari, 2021) is computationally inefficient.  ( 2 min )
    Embed to Control Partially Observed Systems: Representation Learning with Provable Sample Efficiency. (arXiv:2205.13476v1 [cs.LG])
    Reinforcement learning in partially observed Markov decision processes (POMDPs) faces two challenges. (i) It often takes the full history to predict the future, which induces a sample complexity that scales exponentially with the horizon. (ii) The observation and state spaces are often continuous, which induces a sample complexity that scales exponentially with the extrinsic dimension. Addressing such challenges requires learning a minimal but sufficient representation of the observation and state histories by exploiting the structure of the POMDP. To this end, we propose a reinforcement learning algorithm named Embed to Control (ETC), which learns the representation at two levels while optimizing the policy.~(i) For each step, ETC learns to represent the state with a low-dimensional feature, which factorizes the transition kernel. (ii) Across multiple steps, ETC learns to represent the full history with a low-dimensional embedding, which assembles the per-step feature. We integrate (i) and (ii) in a unified framework that allows a variety of estimators (including maximum likelihood estimators and generative adversarial networks). For a class of POMDPs with a low-rank structure in the transition kernel, ETC attains an $O(1/\epsilon^2)$ sample complexity that scales polynomially with the horizon and the intrinsic dimension (that is, the rank). Here $\epsilon$ is the optimality gap. To our best knowledge, ETC is the first sample-efficient algorithm that bridges representation learning and policy optimization in POMDPs with infinite observation and state spaces.  ( 2 min )
    Fair Representation Learning through Implicit Path Alignment. (arXiv:2205.13316v1 [cs.LG])
    We consider a fair representation learning perspective, where optimal predictors, on top of the data representation, are ensured to be invariant with respect to different sub-groups. Specifically, we formulate this intuition as a bi-level optimization, where the representation is learned in the outer-loop, and invariant optimal group predictors are updated in the inner-loop. Moreover, the proposed bi-level objective is demonstrated to fulfill the sufficiency rule, which is desirable in various practical scenarios but was not commonly studied in the fair learning. Besides, to avoid the high computational and memory cost of differentiating in the inner-loop of bi-level objective, we propose an implicit path alignment algorithm, which only relies on the solution of inner optimization and the implicit differentiation rather than the exact optimization path. We further analyze the error gap of the implicit approach and empirically validate the proposed method in both classification and regression settings. Experimental results show the consistently better trade-off in prediction performance and fairness measurement.  ( 2 min )
    Sym-NCO: Leveraging Symmetricity for Neural Combinatorial Optimization. (arXiv:2205.13209v1 [cs.LG])
    Deep reinforcement learning (DRL)-based combinatorial optimization (CO) methods (i.e., DRL-NCO) have shown significant merit over the conventional CO solvers as DRL-NCO is capable of learning CO solvers without supervised labels attained from the verified solver. This paper presents a novel training scheme, Sym-NCO, that achieves significant performance increments to existing DRL-NCO methods. Sym-NCO is a regularizer-based training scheme that leverages universal symmetricities in various CO problems and solutions. Imposing symmetricities such as rotational and reflectional invariance can greatly improve generalization capability of DRL-NCO as symmetricities are invariant features shared by certain CO tasks. Our experimental results verify that our Sym-NCO greatly improves the performance of DRL-NCO methods in four CO tasks, including traveling salesman problem (TSP), capacitated vehicle routing problem (CVRP), prize collecting TSP (PCTSP), and orienteering problem (OP), without employing problem-specific techniques. Remarkably, Sym-NCO outperformed not only the existing DRL-NCO methods but also a competitive conventional solver, the iterative local search (ILS), in PCTSP at 240 times faster speed.  ( 2 min )
    Active Labeling: Streaming Stochastic Gradients. (arXiv:2205.13255v1 [cs.LG])
    The workhorse of machine learning is stochastic gradient descent. To access stochastic gradients, it is common to consider iteratively input/output pairs of a training dataset. Interestingly, it appears that one does not need full supervision to access stochastic gradients, which is the main motivation of this paper. After formalizing the "active labeling" problem, which generalizes active learning based on partial supervision, we provide a streaming technique that provably minimizes the ratio of generalization error over number of samples. We illustrate our technique in depth for robust regression.  ( 2 min )
    Identifying Patient-Specific Root Causes with the Heteroscedastic Noise Model. (arXiv:2205.13085v1 [stat.ML])
    Complex diseases are caused by a multitude of factors that may differ between patients even within the same diagnostic category. A few underlying root causes may nevertheless initiate the development of disease within each patient. We therefore focus on identifying patient-specific root causes of disease, which we equate to the sample-specific predictivity of the exogenous error terms in a structural equation model. We generalize from the linear setting to the heteroscedastic noise model where $Y = m(X) + \varepsilon\sigma(X)$ with non-linear functions $m(X)$ and $\sigma(X)$ representing the conditional mean and mean absolute deviation, respectively. This model preserves identifiability but introduces non-trivial challenges that require a customized algorithm called Generalized Root Causal Inference (GRCI) to extract the error terms correctly. GRCI recovers patient-specific root causes more accurately than existing alternatives.  ( 2 min )
    Factorized Structured Regression for Large-Scale Varying Coefficient Models. (arXiv:2205.13080v1 [stat.ML])
    Recommender Systems (RS) pervade many aspects of our everyday digital life. Proposed to work at scale, state-of-the-art RS allow the modeling of thousands of interactions and facilitate highly individualized recommendations. Conceptually, many RS can be viewed as instances of statistical regression models that incorporate complex feature effects and potentially non-Gaussian outcomes. Such structured regression models, including time-aware varying coefficients models, are, however, limited in their applicability to categorical effects and inclusion of a large number of interactions. Here, we propose Factorized Structured Regression (FaStR) for scalable varying coefficient models. FaStR overcomes limitations of general regression models for large-scale data by combining structured additive regression and factorization approaches in a neural network-based model implementation. This fusion provides a scalable framework for the estimation of statistical models in previously infeasible data settings. Empirical results confirm that the estimation of varying coefficients of our approach is on par with state-of-the-art regression techniques, while scaling notably better and also being competitive with other time-aware RS in terms of prediction performance. We illustrate FaStR's performance and interpretability on a large-scale behavioral study with smartphone user data.  ( 2 min )
    On Learning Mixture of Linear Regressions in the Non-Realizable Setting. (arXiv:2205.13166v1 [stat.ML])
    While mixture of linear regressions (MLR) is a well-studied topic, prior works usually do not analyze such models for prediction error. In fact, {\em prediction} and {\em loss} are not well-defined in the context of mixtures. In this paper, first we show that MLR can be used for prediction where instead of predicting a label, the model predicts a list of values (also known as {\em list-decoding}). The list size is equal to the number of components in the mixture, and the loss function is defined to be minimum among the losses resulted by all the component models. We show that with this definition, a solution of the empirical risk minimization (ERM) achieves small probability of prediction error. This begs for an algorithm to minimize the empirical risk for MLR, which is known to be computationally hard. Prior algorithmic works in MLR focus on the {\em realizable} setting, i.e., recovery of parameters when data is probabilistically generated by a mixed linear (noisy) model. In this paper we show that a version of the popular alternating minimization (AM) algorithm finds the best fit lines in a dataset even when a realizable model is not assumed, under some regularity conditions on the dataset and the initial points, and thereby provides a solution for the ERM. We further provide an algorithm that runs in polynomial time in the number of datapoints, and recovers a good approximation of the best fit lines. The two algorithms are experimentally compared.  ( 2 min )
    Learning the spatio-temporal relationship between wind and significant wave height using deep learning. (arXiv:2205.13325v1 [stat.ML])
    Ocean wave climate has a significant impact on near-shore and off-shore human activities, and its characterisation can help in the design of ocean structures such as wave energy converters and sea dikes. Therefore, engineers need long time series of ocean wave parameters. Numerical models are a valuable source of ocean wave data; however, they are computationally expensive. Consequently, statistical and data-driven approaches have gained increasing interest in recent decades. This work investigates the spatio-temporal relationship between North Atlantic wind and significant wave height (Hs) at an off-shore location in the Bay of Biscay, using a two-stage deep learning model. The first step uses convolutional neural networks (CNNs) to extract the spatial features that contribute to Hs. Then, long short-term memory (LSTM) is used to learn the long-term temporal dependencies between wind and waves.  ( 2 min )
    Optimal Neural Network Approximation of Wasserstein Gradient Direction via Convex Optimization. (arXiv:2205.13098v1 [cs.LG])
    The computation of Wasserstein gradient direction is essential for posterior sampling problems and scientific computing. The approximation of the Wasserstein gradient with finite samples requires solving a variational problem. We study the variational problem in the family of two-layer networks with squared-ReLU activations, towards which we derive a semi-definite programming (SDP) relaxation. This SDP can be viewed as an approximation of the Wasserstein gradient in a broader function family including two-layer networks. By solving the convex SDP, we obtain the optimal approximation of the Wasserstein gradient direction in this class of functions. Numerical experiments including PDE-constrained Bayesian inference and parameter estimation in COVID-19 modeling demonstrate the effectiveness of the proposed method.  ( 2 min )
    Efficient and Near-Optimal Smoothed Online Learning for Generalized Linear Functions. (arXiv:2205.13056v1 [stat.ML])
    Due to the drastic gap in complexity between sequential and batch statistical learning, recent work has studied a smoothed sequential learning setting, where Nature is constrained to select contexts with density bounded by 1/{\sigma} with respect to a known measure {\mu}. Unfortunately, for some function classes, there is an exponential gap between the statistically optimal regret and that which can be achieved efficiently. In this paper, we give a computationally efficient algorithm that is the first to enjoy the statistically optimal log(T/{\sigma}) regret for realizable K-wise linear classification. We extend our results to settings where the true classifier is linear in an over-parameterized polynomial featurization of the contexts, as well as to a realizable piecewise-regression setting assuming access to an appropriate ERM oracle. Somewhat surprisingly, standard disagreement-based analyses are insufficient to achieve regret logarithmic in 1/{\sigma}. Instead, we develop a novel characterization of the geometry of the disagreement region induced by generalized linear classifiers. Along the way, we develop numerous technical tools of independent interest, including a general anti-concentration bound for the determinant of certain matrix averages.  ( 2 min )
    Classification ensembles for multivariate functional data with application to mouse movements in web surveys. (arXiv:2205.13380v1 [stat.ME])
    We propose new ensemble models for multivariate functional data classification as combinations of semi-metric-based weak learners. Our models extend current semi-metric-type methods from the univariate to the multivariate case, propose new semi-metrics to compute distances between functions, and consider more flexible options for combining weak learners using stacked generalisation methods. We apply these ensemble models to identify respondents' difficulty with survey questions, with the aim to improve survey data quality. As predictors of difficulty, we use mouse movement trajectories from the respondents' interaction with a web survey, in which several questions were manipulated to create two scenarios with different levels of difficulty.  ( 2 min )
    Undersampling is a Minimax Optimal Robustness Intervention in Nonparametric Classification. (arXiv:2205.13094v1 [cs.LG])
    While a broad range of techniques have been proposed to tackle distribution shift, the simple baseline of training on an $\textit{undersampled}$ dataset often achieves close to state-of-the-art-accuracy across several popular benchmarks. This is rather surprising, since undersampling algorithms discard excess majority group data. To understand this phenomenon, we ask if learning is fundamentally constrained by a lack of minority group samples. We prove that this is indeed the case in the setting of nonparametric binary classification. Our results show that in the worst case, an algorithm cannot outperform undersampling unless there is a high degree of overlap between the train and test distributions (which is unlikely to be the case in real-world datasets), or if the algorithm leverages additional structure about the distribution shift. In particular, in the case of label shift we show that there is always an undersampling algorithm that is minimax optimal. While in the case of group-covariate shift we show that there is an undersampling algorithm that is minimax optimal when the overlap between the group distributions is small. We also perform an experimental case study on a label shift dataset and find that in line with our theory the test accuracy of robust neural network classifiers is constrained by the number of minority samples.  ( 2 min )
    Preference Dynamics Under Personalized Recommendations. (arXiv:2205.13026v1 [cs.LG])
    Many projects (both practical and academic) have designed algorithms to match users to content they will enjoy under the assumption that user's preferences and opinions do not change with the content they see. Evidence suggests that individuals' preferences are directly shaped by what content they see -- radicalization, rabbit holes, polarization, and boredom are all example phenomena of preferences affected by content. Polarization in particular can occur even in ecosystems with "mass media," where no personalization takes place, as recently explored in a natural model of preference dynamics by~\citet{hkazla2019geometric} and~\citet{gaitonde2021polarization}. If all users' preferences are drawn towards content they already like, or are repelled from content they already dislike, uniform consumption of media leads to a population of heterogeneous preferences converging towards only two poles. In this work, we explore whether some phenomenon akin to polarization occurs when users receive \emph{personalized} content recommendations. We use a similar model of preference dynamics, where an individual's preferences move towards content the consume and enjoy, and away from content they consume and dislike. We show that standard user reward maximization is an almost trivial goal in such an environment (a large class of simple algorithms will achieve only constant regret). A more interesting objective, then, is to understand under what conditions a recommendation algorithm can ensure stationarity of user's preferences. We show how to design a content recommendations which can achieve approximate stationarity, under mild conditions on the set of available content, when a user's preferences are known, and how one can learn enough about a user's preferences to implement such a strategy even when user preferences are initially unknown.  ( 2 min )

  • Open

    [P] Looking for datasets containing slide decks / powerpoint presentations
    I'm currently working on a project that aims to generate summaries of slide decks / powerpoint presentations. I'm looking to augment the dataset that I have with some additional slides (hopefully but not necessarily paired with summaries), and wondered if anyone here has come across a dataset I could use. My initial thoughts were to use research paper abstracts paired with conference presentation slides. I was able to collect such a dataset from past ICML publications, however many of the slides are overly technical and lack natural language (e.g. presentations consisting of nothing but equations). Additionally these presentations are quite bit out of distribution with respect to the domain I'd be applying it to (business). So, my question: does anyone know of any publicly available datasets that are composed of slides/presentations, or of any other resources like conferences (particularly in business oriented fields, e.g., finance, marketing, business administration, operations management) that have easily accessible pairs of slide presentations and summaries (either paper abstracts or other forms)? Thanks in advance! submitted by /u/mamossa [link] [comments]  ( 1 min )
    [D] Understanding the fundamental maths vs just use existing Python libraries?
    I'm interested in developing trading algorithms (not bots) using machine learning techniques. I'm a very experienced software developer, I've worked in trading for many years and I am mathematical (couple of masters degrees in engineering- did Fourier Analysis, Laplace, autocorrelation, statistics etc) but haven't used it for years. Whilst looking for machine learning books there appear to be two approaches: Practical/use existing Python libraries (Hands on Machine Learning with Scikit......) Theory, understand the fundamental statistics (The Elements of Statistical Learning) Do I need to/what is the advantage of understanding the fundamental statistics? Is it possible to simply use existing ML libraries, configure parameters and run simulations? If I need to teach myself it, I will. But if there's very small benefit it's better I invest my time elsewhere. Any additional advice would be greatly appreciated. I'm not sure what holding-period yet (HFT or statistical arbitrage) but please feel free to distinguish if the answer differs. submitted by /u/reddit_faa7777 [link] [comments]  ( 2 min )
    [R] An Evolutionary Approach to Dynamic Introduction of Tasks in Large-scale Multitask Learning Systems - Google 2022 - Jeff Dean
    Paper: https://arxiv.org/abs/2205.12755 Abstract: "Multitask learning assumes that models capable of learning from multiple tasks can achieve better quality and efficiency via knowledge transfer, a key feature of human learning. Though, state of the art ML models rely on high customization for each task and leverage size and data scale rather than scaling the number of tasks. Also, continual learning, that adds the temporal aspect to multitask, is often focused to the study of common pitfalls such as catastrophic forgetting instead of being studied at a large scale as a critical component to build the next generation artificial intelligence. We propose an evolutionary method that can generate a large scale multitask model, and can support the dynamic and continuous addition of new tasks. The generated multitask model is sparsely activated and integrates a task-based routing that guarantees bounded compute cost and fewer added parameters per task as the model expands. The proposed method relies on a knowledge compartmentalization technique to achieve immunity against catastrophic forgetting and other common pitfalls such as gradient interference and negative transfer. We empirically show that the proposed method can jointly solve and achieve competitive results on 69image classification tasks, for example achieving the best test accuracy reported for a model trained only on public data for competitive tasks such as cifar10: 99.43%." https://www.youtube.com/watch?v=Pcin4hPGaOk https://preview.redd.it/ni93muz8fv191.jpg?width=1108&format=pjpg&auto=webp&s=1782b2100bcedd03db4443fa6c58ef7c4c904488 https://preview.redd.it/8txpqsgafv191.jpg?width=987&format=pjpg&auto=webp&s=f723e57e78f81ce38c19917af2db3352f8d3a47c https://preview.redd.it/qb0xj9safv191.jpg?width=1128&format=pjpg&auto=webp&s=792056a1ae7b72a36d6e429d92e13f0f11249e8e submitted by /u/Singularian2501 [link] [comments]  ( 1 min )
    [D] How can we approach this problem ?
    You have a combined dataset consisting of 10 component datasets collected from 10 different sources. Independent models trained separately on each component dataset perform well on hold-out examples from that dataset. However, the aggregated model trained by combining the examples from all component datasets behaves weirdly. On hold-out examples from some component datasets, the aggregated model performs better than the independent models. On others, it performs worse than the independent models. During deployment, you expect to see input examples from these 10 component sources but also from many other sources which the model has not been trained on. What approach will you take to develop a model that will generalize well to examples from the seen and also the yet-unseen sources? submitted by /u/corporatededmeat [link] [comments]  ( 1 min )
    [R] New datasets for StyleGAN
    Hi all, The Author is here. TL;DR: We show how StyleGAN can be adapted to raw unaligned images collected from the Internet. New datasets and models are available. ​ How can we adapt StyleGAN to more complicated datasets? We have witnessed that a data-centric approach is the most effective. Raw image collections downloaded from the internet contain many outlier images and are characterized by a multi-modal distribution. Therefore, we perform automatic self-supervised filtering of the training data to remove the outliers. Our key idea is to use the generator itself for the filtering. In the second step, we employ a multi-modal variant of the StyleGAN truncation trick. This allows high quality generation while preserving the remarkable editing capabilities of StyleGAN. For more details and cool gifs, check our Project Page:https://self-distilled-stylegan.github.io/ Datasets and models: https://github.com/self-distilled-stylegan/self-distilled-internet-photos The datasets also can be directly downloaded: https://github.com/rmokady/SDIP_utils Demo for image generation: https://huggingface.co/spaces/hysts/Self-Distilled-StyleGAN ​ Feel free to ask anything that comes to your mind ​ Generated Dog Generated Elephant submitted by /u/RonMokady [link] [comments]  ( 1 min )
    [R] CNNs are Myopic
    submitted by /u/downtownslim [link] [comments]  ( 3 min )
    [R] Large Language Models are Zero-Shot Reasoners. My summary: Adding text such as "Let’s think step by step" to a prompt "elicits chain of thought from large language models across a variety of reasoning tasks".
    submitted by /u/Wiskkey [link] [comments]  ( 4 min )
    [D] Semantic Segmentation/Remote Sensing Challenges
    Does anyone know of any interesting Semantic Segmentation and/or Remote Sensing competition taking place this summer? Most of what I found ends in the next 1-2 weeks. submitted by /u/incognitoacnt [link] [comments]
    [News] New Anomaly Advisor tab on Netdata
    Hello everyone, I work on the development of a host monitoring system called Netdata, a free and open source software. I found it interesting to present here the new anomalies tab which shows high anomaly rates on all its nodes using machine learning. We made a post on our blog about it in more detail. https://www.netdata.cloud/blog/introducing-anomaly-advisor-unsupervised-anomaly-detection-in-netdata/ submitted by /u/roeant [link] [comments]
  • Open

    All ML and AI E-books for $10 Sale!
    submitted by /u/alimhabidi [link] [comments]
    How do leading AIs do on human IQ tests in 2022?
    Could not find an answer to this question online. submitted by /u/ShallowStroker [link] [comments]  ( 1 min )
    In this article, we show how to labels PDFs and scanned images using OCR in order to create your training dataset for your NLP application.
    submitted by /u/UBIAI [link] [comments]  ( 1 min )
    Corporate America bets on AI productivity
    submitted by /u/mr_j_b [link] [comments]
    Latest Artificial Intelligence Technologies — AI
    submitted by /u/mr_j_b [link] [comments]
    HPE is building a rapid AI supercomputer powered by the world’s largest CPU
    submitted by /u/mr_j_b [link] [comments]
    FUTURE TECH ASIA China and Europe are leading the push to regulate A.I. — one of them could set the global playbook
    submitted by /u/mr_j_b [link] [comments]  ( 1 min )
    Could AI Image-Generation technology like DALL-E and Imagen be the future of messaging?
    I feel like the application of technologies like DALL-E and Imagen would be perfect for everyday communication with friends and family, in the same way emojis and images and gifs enhance messaging today. Imagine an in-app messaging feature for iMessage or WhatsApp which allows you to type a prompt then scroll through various AI-generated images or gifs that fit the prompt to send to your friends and family rather than the emojis and gifs that most of us use now. Over recent years we have adapted to send emojis and images and gifs to express complex emotions and sentiments that would take lots of text to accomplish, but they’re still approximations because we are relying on an existing set of created images and videos. I imagine made-to-measure images and videos could be a type of casual communication in the future. submitted by /u/Psychadiculous [link] [comments]  ( 1 min )
    Microsoft AI Researchers Introduces (De)ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection
    While there are several benefits to using Artificial Intelligence, there are also drawbacks to this cutting-edge technology. One example is the creation of inappropriate language by language models. Because these models are trained on massive amounts of data, inappropriate language may be learned due to its presence in the training data. In only specific cases, content moderation techniques can be used to flag or filter such language however, the datasets used to train these programs frequently fail to capture the complexity of potentially unsuitable and poisonous language, particularly hate speech. Furthermore, the neutral samples in these datasets seldom contain group references. As a result, tools may flag even neutral language that refers to a minority identification group as hate speech. A dataset needs to be created for training content moderation algorithms that may be used to detect better implicitly harmful material, inspired by big language models’ capacity to emulate the tone, style, and vocabulary of cues they receive, whether toxic or benign. Continue Reading | Check out the paper, Microsoft blog and Github codes https://preview.redd.it/zkushg4jdu191.png?width=1302&format=png&auto=webp&s=29bbf108b18539083d478ea718078cf3502962b7 submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    AI-art isn't art
    submitted by /u/estasfuera [link] [comments]  ( 14 min )
    New framework for firms to ensure AI services are reliable, safe
    submitted by /u/kuang89 [link] [comments]  ( 1 min )
    AI Dream 53 - Cosmic Birth | 300 SUBS CELEBRATION
    submitted by /u/LordPewPew777 [link] [comments]
  • Open

    Help needed
    Please, does anyone have experience in using Gaussian processes to update q values instead of a deep learning network in an RL setting? I have separate codes for both but I need some help in merging them for a project. I will be happy to pay for the consultation if needed and feel free to give a referral if you know a friend or have access to some helpful resources. This RL is in an off-policy setting and for a static dataset. submitted by /u/Thin-Ad9581 [link] [comments]  ( 1 min )
    Classic control from pixels
    Hey all I'm trying to play around with off-policy learning from pixels/images. To start easy, I thought it would be a good idea to take a classic control environment like the Pendulum-v1 and try to solve it from pixels directly. To do this, I decided to wrap the gym env in a custom wrapper which seems to be working for all intents and purposes: class ImgObservationWrapper(Wrapper): def __init__(self, env): super(ImgObservationWrapper, self).__init__(env) self.reset() dummy_obs = self.process_img(env.render(mode="rgb_array")) self.observation_space = spaces.Box( low=0, high=255, shape=dummy_obs.shape, dtype=dummy_obs.dtype ) def reset(self, **kwargs): obs = self.env.reset(**kwargs) obs = self.process_img(self.env.render(mode="rgb_array")) return obs def process_img(self, observation, crop…  ( 1 min )
    Help with MADDPG on Food Collector (Unity ML-Agents)
    Can somebody help me with MADDPG? I am trying to apply it to Food Collector (Unity ML-Agents), and my model is constantly converging to action extremes (either -1 or 1) after only couple of steps. There are 5 agents, 40x40x5 obs and 3 continuous actions and 1 discrete. Buffer size is 10k, agent is taking random actions for first 10k steps. https://preview.redd.it/9gon27amsu191.png?width=725&format=png&auto=webp&s=ebe9f3552b9e39b2f480cb01ad35cd6e3ae21092 These are actions of agent#1, so you can see example of model quickly converging to -1 or 1 (in this case -1, -1, 1) ​ submitted by /u/TheGuy839 [link] [comments]  ( 1 min )
  • Open

    Breakthrough Google AI Text to Image Imagen Beats Dalle-2 With Unprecedented Photorealism and Deep Level of Language Understanding
    submitted by /u/getrich_or_diemining [link] [comments]
  • Open

    How AI Impact the future of Human Capital Management in 2022
    How AI Impact the future of Human Capital Management in 2022  ( 5 min )
  • Open

    A Devotion to Emotion: Hume AI’s Alan Cowen on the Intersection of AI and Empathy
    Can machines experience emotions? They might, according to Hume AI, an AI research lab and technology company that aims to “ensure artificial intelligence is built to serve human goals and emotional well-being.” So how can AI genuinely understand how we are feeling, and respond appropriately? On this episode of NVIDIA’s AI Podcast, host Noah Kravitz Read article > The post A Devotion to Emotion: Hume AI’s Alan Cowen on the Intersection of AI and Empathy appeared first on NVIDIA Blog.  ( 2 min )
    Ready, Set, Game: GFN Thursday Brings 10 New Titles to GeForce NOW
    It’s a beautiful day to play video games. And it’s GFN Thursday, which means we’ve got those games. Ten total titles join the GeForce NOW library of over 1,300 games, starting with the release of Roller Champions – a speedy, free-to-play roller skating title launching with competitive season 0. Rollin’ Into the Weekend Roll with Read article > The post Ready, Set, Game: GFN Thursday Brings 10 New Titles to GeForce NOW appeared first on NVIDIA Blog.  ( 2 min )
    Deciphering the Future: HPE Switches on AI Supercomputer in France
    Recalling the French linguist who deciphered the Rosetta Stone 150 years ago, Hewlett Packard Enterprise today switched on a tool to unravel its customers’ knottiest problems. The Champollion AI supercomputer takes its name from Jean-François Champollion (1790-1832), who decoded hieroglyphics that opened a door to study of ancient Egypt’s culture. Like Champollion, the mega-system resides Read article > The post Deciphering the Future: HPE Switches on AI Supercomputer in France appeared first on NVIDIA Blog.  ( 3 min )
  • Open

    Interactive Packaging: How to Make Packaging Smarter with AI and IoT
    Today, Packaging is not just about covering a product for a better sale. Interactive packaging is a new trend in the packaging industry which mainly focuses on customer satisfaction and engagement. BLE codes, AI, and IoT are key technologies in interactive packaging which is helping users with the product attributes and user instructions. Also, Incorporation… Read More »Interactive Packaging: How to Make Packaging Smarter with AI and IoT The post Interactive Packaging: How to Make Packaging Smarter with AI and IoT appeared first on Data Science Central.  ( 2 min )
    How to Choose the Best Salon CRM Software
    As you may know, everyone desires to look good or better than everyone else. To look pretty, cooler, smarter, and bold people go to the salons. Because a salon is a place that takes care of your outer beauty and makes you look prettier than before. They use different products on your body and which… Read More »How to Choose the Best Salon CRM Software The post How to Choose the Best Salon CRM Software appeared first on Data Science Central.  ( 5 min )
  • Open

    Deep interpretable ensembles. (arXiv:2205.12729v1 [stat.ML])
    Ensembles improve prediction performance and allow uncertainty quantification by aggregating predictions from multiple models. In deep ensembling, the individual models are usually black box neural networks, or recently, partially interpretable semi-structured deep transformation models. However, interpretability of the ensemble members is generally lost upon aggregation. This is a crucial drawback of deep ensembles in high-stake decision fields, in which interpretable models are desired. We propose a novel transformation ensemble which aggregates probabilistic predictions with the guarantee to preserve interpretability and yield uniformly better predictions than the ensemble members on average. Transformation ensembles are tailored towards interpretable deep transformation models but are applicable to a wider range of probabilistic neural networks. In experiments on several publicly available data sets, we demonstrate that transformation ensembles perform on par with classical deep ensembles in terms of prediction performance, discrimination, and calibration. In addition, we demonstrate how transformation ensembles quantify both aleatoric and epistemic uncertainty, and produce minimax optimal predictions under certain conditions.  ( 2 min )
    MAVIPER: Learning Decision Tree Policies for Interpretable Multi-Agent Reinforcement Learning. (arXiv:2205.12449v1 [cs.LG])
    Many recent breakthroughs in multi-agent reinforcement learning (MARL) require the use of deep neural networks, which are challenging for human experts to interpret and understand. On the other hand, existing work on interpretable RL has shown promise in extracting more interpretable decision tree-based policies, but only in the single-agent setting. To fill this gap, we propose the first set of interpretable MARL algorithms that extract decision-tree policies from neural networks trained with MARL. The first algorithm, IVIPER, extends VIPER, a recent method for single-agent interpretable RL, to the multi-agent setting. We demonstrate that IVIPER can learn high-quality decision-tree policies for each agent. To better capture coordination between agents, we propose a novel centralized decision-tree training algorithm, MAVIPER. MAVIPER jointly grows the trees of each agent by predicting the behavior of the other agents using their anticipated trees, and uses resampling to focus on states that are critical for its interactions with other agents. We show that both algorithms generally outperform the baselines and that MAVIPER-trained agents achieve better-coordinated performance than IVIPER-trained agents on three different multi-agent particle-world environments.  ( 2 min )
    ORCA: Interpreting Prompted Language Models via Locating Supporting Data Evidence in the Ocean of Pretraining Data. (arXiv:2205.12600v1 [cs.CL])
    Large pretrained language models have been performing increasingly well in a variety of downstream tasks via prompting. However, it remains unclear from where the model learns the task-specific knowledge, especially in a zero-shot setup. In this work, we want to find evidence of the model's task-specific competence from pretraining and are specifically interested in locating a very small subset of pretraining data that directly supports the model in the task. We call such a subset supporting data evidence and propose a novel method ORCA to effectively identify it, by iteratively using gradient information related to the downstream task. This supporting data evidence offers interesting insights about the prompted language models: in the tasks of sentiment analysis and textual entailment, BERT shows a substantial reliance on BookCorpus, the smaller corpus of BERT's two pretraining corpora, as well as on pretraining examples that mask out synonyms to the task verbalizers.  ( 2 min )
    Online Metro Origin-Destination Prediction via Heterogeneous Information Aggregation. (arXiv:2107.00946v5 [cs.LG] UPDATED)
    Metro origin-destination prediction is a crucial yet challenging time-series analysis task in intelligent transportation systems, which aims to accurately forecast two specific types of cross-station ridership, i.e., Origin-Destination (OD) one and Destination-Origin (DO) one. However, complete OD matrices of previous time intervals can not be obtained immediately in online metro systems, and conventional methods only used limited information to forecast the future OD and DO ridership separately. In this work, we proposed a novel neural network module termed Heterogeneous Information Aggregation Machine (HIAM), which fully exploits heterogeneous information of historical data (e.g., incomplete OD matrices, unfinished order vectors, and DO matrices) to jointly learn the evolutionary patterns of OD and DO ridership. Specifically, an OD modeling branch estimates the potential destinations of unfinished orders explicitly to complement the information of incomplete OD matrices, while a DO modeling branch takes DO matrices as input to capture the spatial-temporal distribution of DO ridership. Moreover, a Dual Information Transformer is introduced to propagate the mutual information among OD features and DO features for modeling the OD-DO causality and correlation. Based on the proposed HIAM, we develop a unified Seq2Seq network to forecast the future OD and DO ridership simultaneously. Extensive experiments conducted on two large-scale benchmarks demonstrate the effectiveness of our method for online metro origin-destination prediction. Our code is resealed at https://github.com/HCPLab-SYSU/HIAM.  ( 2 min )
    Surprises in adversarially-trained linear regression. (arXiv:2205.12695v1 [stat.ML])
    State-of-the-art machine learning models can be vulnerable to very small input perturbations that are adversarially constructed. Adversarial training is one of the most effective approaches to defend against such examples. We show that for linear regression problems, adversarial training can be formulated as a convex problem. This fact is then used to show that $\ell_\infty$-adversarial training produces sparse solutions and has many similarities to the lasso method. Similarly, $\ell_2$-adversarial training has similarities with ridge regression. We use a robust regression framework to analyze and understand these similarities and also point to some differences. Finally, we show how adversarial training behaves differently from other regularization methods when estimating overparameterized models (i.e., models with more parameters than datapoints). It minimizes a sum of three terms which regularizes the solution, but unlike lasso and ridge regression, it can sharply transition into an interpolation mode. We show that for sufficiently many features or sufficiently small regularization parameters, the learned model perfectly interpolates the training data while still exhibiting good out-of-sample performance.  ( 2 min )
    Differentially Private AUC Computation in Vertical Federated Learning. (arXiv:2205.12412v1 [cs.LG])
    Federated learning has gained great attention recently as a privacy-enhancing tool to jointly train a machine learning model by multiple parties. As a sub-category, vertical federated learning (vFL) focuses on the scenario where features and labels are split into different parties. The prior work on vFL has mostly studied how to protect label privacy during model training. However, model evaluation in vFL might also lead to potential leakage of private label information. One mitigation strategy is to apply label differential privacy (DP) but it gives bad estimations of the true (non-private) metrics. In this work, we propose two evaluation algorithms that can more accurately compute the widely used AUC (area under curve) metric when using label DP in vFL. Through extensive experiments, we show our algorithms can achieve more accurate AUCs compared to the baselines.  ( 2 min )
    Wavelet Feature Maps Compression for Image-to-Image CNNs. (arXiv:2205.12268v1 [cs.CV])
    Convolutional Neural Networks (CNNs) are known for requiring extensive computational resources, and quantization is among the best and most common methods for compressing them. While aggressive quantization (i.e., less than 4-bits) performs well for classification, it may cause severe performance degradation in image-to-image tasks such as semantic segmentation and depth estimation. In this paper, we propose Wavelet Compressed Convolution (WCC) -- a novel approach for high-resolution activation maps compression integrated with point-wise convolutions, which are the main computational cost of modern architectures. To this end, we use an efficient and hardware-friendly Haar-wavelet transform, known for its effectiveness in image compression, and define the convolution on the compressed activation map. We experiment on various tasks, that benefit from high-resolution input, and by combining WCC with light quantization, we achieve compression rates equivalent to 1-4bit activation quantization with relatively small and much more graceful degradation in performance.  ( 2 min )
    Giga-scale Kernel Matrix Vector Multiplication on GPU. (arXiv:2202.01085v2 [math.NA] UPDATED)
    Kernel matrix-vector multiplication (KMVM) is a foundational operation in machine learning and scientific computing. However, as KMVM tends to scale quadratically in both memory and time, applications are often limited by these computational constraints. In this paper, we propose a novel approximation procedure coined \textit{Faster-Fast and Free Memory Method} ($\text{F}^3$M) to address these scaling issues of KMVM for tall~($10^8\sim 10^9$) and skinny~($D\leq7$) data. Extensive experiments demonstrate that $\text{F}^3$M has empirical \emph{linear time and memory} complexity with a relative error of order $10^{-3}$ and can compute a full KMVM for a billion points \emph{in under a minute} on a high-end GPU, leading to a significant speed-up in comparison to existing CPU methods. We demonstrate the utility of our procedure by applying it as a drop-in for the state-of-the-art GPU-based linear solver FALKON, \emph{improving speed 1.5-5.5 times} at the cost of $<1\%$ drop in accuracy. We further demonstrate competitive results on \emph{Gaussian Process regression} coupled with significant speedups on a variety of real-world datasets.  ( 2 min )
    Sketch guided and progressive growing GAN for realistic and editable ultrasound image synthesis. (arXiv:2204.06929v3 [eess.IV] UPDATED)
    Ultrasound (US) imaging is widely used for anatomical structure inspection in clinical diagnosis. The training of new sonographers and deep learning based algorithms for US image analysis usually requires a large amount of data. However, obtaining and labeling large-scale US imaging data are not easy tasks, especially for diseases with low incidence. Realistic US image synthesis can alleviate this problem to a great extent. In this paper, we propose a generative adversarial network (GAN) based image synthesis framework. Our main contributions include: 1) we present the first work that can synthesize realistic B-mode US images with high-resolution and customized texture editing features; 2) to enhance structural details of generated images, we propose to introduce auxiliary sketch guidance into a conditional GAN. We superpose the edge sketch onto the object mask and use the composite mask as the network input; 3) to generate high-resolution US images, we adopt a progressive training strategy to gradually generate high-resolution images from low-resolution images. In addition, a feature loss is proposed to minimize the difference of high-level features between the generated and real images, which further improves the quality of generated images; 4) the proposed US image synthesis method is quite universal and can also be generalized to the US images of other anatomical structures besides the three ones tested in our study (lung, hip joint, and ovary); 5) extensive experiments on three large US image datasets are conducted to validate our method. Ablation studies, customized texture editing, user studies, and segmentation tests demonstrate promising results of our method in synthesizing realistic US images.  ( 3 min )
    TrustGNN: Graph Neural Network based Trust Evaluation via Learnable Propagative and Composable Nature. (arXiv:2205.12784v1 [cs.LG])
    Trust evaluation is critical for many applications such as cyber security, social communication and recommender systems. Users and trust relationships among them can be seen as a graph. Graph neural networks (GNNs) show their powerful ability for analyzing graph-structural data. Very recently, existing work attempted to introduce the attributes and asymmetry of edges into GNNs for trust evaluation, while failed to capture some essential properties (e.g., the propagative and composable nature) of trust graphs. In this work, we propose a new GNN based trust evaluation method named TrustGNN, which integrates smartly the propagative and composable nature of trust graphs into a GNN framework for better trust evaluation. Specifically, TrustGNN designs specific propagative patterns for different propagative processes of trust, and distinguishes the contribution of different propagative processes to create new trust. Thus, TrustGNN can learn comprehensive node embeddings and predict trust relationships based on these embeddings. Experiments on some widely-used real-world datasets indicate that TrustGNN significantly outperforms the state-of-the-art methods. We further perform analytical experiments to demonstrate the effectiveness of the key designs in TrustGNN.  ( 2 min )
    To Impute or not to Impute? Missing Data in Treatment Effect Estimation. (arXiv:2202.02096v3 [stat.ML] UPDATED)
    Missing data is a systemic problem in practical scenarios that causes noise and bias when estimating treatment effects. This makes treatment effect estimation from data with missingness a particularly tricky endeavour. A key reason for this is that standard assumptions on missingness are rendered insufficient due to the presence of an additional variable, treatment, besides the individual and the outcome. Having a treatment variable introduces additional complexity with respect to why some variables are missing that is not fully explored by previous work. In our work we identify a new missingness mechanism, which we term mixed confounded missingness (MCM), where some missingness determines treatment selection and other missingness is determined by treatment selection. Given MCM, we show that naively imputing all data leads to poor performing treatment effects models, as the act of imputation effectively removes information necessary to provide unbiased estimates. However, no imputation at all also leads to biased estimates, as missingness determined by treatment divides the population in distinct subpopulations, where estimates across these populations will be biased. Our solution is selective imputation, where we use insights from MCM to inform precisely which variables should be imputed and which should not. We empirically demonstrate how various learners benefit from selective imputation compared to other solutions for missing data.  ( 2 min )
    Cross Domain Few-Shot Learning via Meta Adversarial Training. (arXiv:2202.05713v3 [cs.LG] UPDATED)
    Few-shot relation classification (RC) is one of the critical problems in machine learning. Current research merely focuses on the set-ups that both training and testing are from the same domain. However, in practice, this assumption is not always guaranteed. In this study, we present a novel model that takes into consideration the afore-mentioned cross-domain situation. Not like previous models, we only use the source domain data to train the prototypical networks and test the model on target domain data. A meta-based adversarial training framework (MBATF) is proposed to fine-tune the trained networks for adapting to data from the target domain. Empirical studies confirm the effectiveness of the proposed model.  ( 2 min )
    HEBO Pushing The Limits of Sample-Efficient Hyperparameter Optimisation. (arXiv:2012.03826v6 [cs.LG] UPDATED)
    In this work we rigorously analyse assumptions inherent to black-box optimisation hyper-parameter tuning tasks. Our results on the Bayesmark benchmark indicate that heteroscedasticity and non-stationarity pose significant challenges for black-box optimisers. Based on these findings, we propose a Heteroscedastic and Evolutionary Bayesian Optimisation solver (HEBO). HEBO performs non-linear input and output warping, admits exact marginal log-likelihood optimisation and is robust to the values of learned parameters. We demonstrate HEBO's empirical efficacy on the NeurIPS 2020 Black-Box Optimisation challenge, where HEBO placed first. Upon further analysis, we observe that HEBO significantly outperforms existing black-box optimisers on 108 machine learning hyperparameter tuning tasks comprising the Bayesmark benchmark. Our findings indicate that the majority of hyper-parameter tuning tasks exhibit heteroscedasticity and non-stationarity, multi-objective acquisition ensembles with Pareto front solutions improve queried configurations, and robust acquisition maximisers afford empirical advantages relative to their non-robust counterparts. We hope these findings may serve as guiding principles for practitioners of Bayesian optimisation. All code is made available at https://github.com/huawei-noah/HEBO.  ( 2 min )
    Differentially Private Data Generation Needs Better Features. (arXiv:2205.12900v1 [stat.ML])
    Training even moderately-sized generative models with differentially-private stochastic gradient descent (DP-SGD) is difficult: the required level of noise for reasonable levels of privacy is simply too large. We advocate instead building off a good, relevant representation on public data, then using private data only for "transfer learning." In particular, we minimize the maximum mean discrepancy (MMD) between private target data and the generated distribution, using a kernel based on perceptual features from a public dataset. With the MMD, we can simply privatize the data-dependent term once and for all, rather than introducing noise at each step of optimization as in DP-SGD. Our algorithm allows us to generate CIFAR10-level images faithfully with $\varepsilon \approx 2$, far surpassing the current state of the art, which only models MNIST and FashionMNIST at $\varepsilon \approx 10$. Our work introduces simple yet powerful foundations for reducing the gap between private and non-private deep generative models.  ( 2 min )
    Deep Reinforcement Learning Guided Graph Neural Networks for Brain Network Analysis. (arXiv:2203.10093v2 [cs.LG] UPDATED)
    Modern neuroimaging techniques, such as diffusion tensor imaging (DTI) and functional magnetic resonance imaging (fMRI), enable us to model the human brain as a brain network or connectome. Capturing brain networks' structural information and hierarchical patterns is essential for understanding brain functions and disease states. Recently, the promising network representation learning capability of graph neural networks (GNNs) has prompted many GNN-based methods for brain network analysis to be proposed. Specifically, these methods apply feature aggregation and global pooling to convert brain network instances into meaningful low-dimensional representations used for downstream brain network analysis tasks. However, existing GNN-based methods often neglect that brain networks of different subjects may require various aggregation iterations and use GNN with a fixed number of layers to learn all brain networks. Therefore, how to fully release the potential of GNNs to promote brain network analysis is still non-trivial. To solve this problem, we propose a novel brain network representation framework, namely BN-GNN, which searches for the optimal GNN architecture for each brain network. Concretely, BN-GNN employs deep reinforcement learning (DRL) to train a meta-policy to automatically determine the optimal number of feature aggregations (reflected in the number of GNN layers) required for a given brain network. Extensive experiments on eight real-world brain network datasets demonstrate that our proposed BN-GNN improves the performance of traditional GNNs on different brain network analysis tasks.  ( 2 min )
    Robust Reinforcement Learning on Graphs for Logistics optimization. (arXiv:2205.12888v1 [cs.LG])
    Logistics optimization nowadays is becoming one of the hottest areas in the AI community. In the past year, significant advancements in the domain were achieved by representing the problem in a form of graph. Another promising area of research was to apply reinforcement learning algorithms to the above task. In our work, we made advantage of using both approaches and apply reinforcement learning on a graph. To do that, we have analyzed the most recent results in both fields and selected SOTA algorithms both from graph neural networks and reinforcement learning. Then, we combined selected models on the problem of AMOD systems optimization for the transportation network of New York city. Our team compared three algorithms - GAT, Pro-CNN and PTDNet - to bring to the fore the important nodes on a graph representation. Finally, we achieved SOTA results on AMOD systems optimization problem employing PTDNet with GNN and training them in reinforcement fashion. Keywords: Graph Neural Network (GNN), Logistics optimization, Reinforcement Learning  ( 2 min )
    Residual-Concatenate Neural Network with Deep Regularization Layers for Binary Classification. (arXiv:2205.12775v1 [cs.LG])
    Many complex Deep Learning models are used with different variations for various prognostication tasks. The higher learning parameters not necessarily ensure great accuracy. This can be solved by considering changes in very deep models with many regularization based techniques. In this paper we train a deep neural network that uses many regularization layers with residual and concatenation process for best fit with Polycystic Ovary Syndrome Diagnosis prognostication. The network was built with improvements from every step of failure to meet the needs of the data and achieves an accuracy of 99.3% seamlessly.  ( 2 min )
    Gradient-based explanations for Gaussian Process regression and classification models. (arXiv:2205.12797v1 [cs.LG])
    Gaussian Processes (GPs) have proven themselves as a reliable and effective method in probabilistic Machine Learning. Thanks to recent and current advances, modeling complex data with GPs is becoming more and more feasible. Thus, these types of models are, nowadays, an interesting alternative to Neural and Deep Learning methods, which are arguably the current state-of-the-art in Machine Learning. For the latter, we see an increasing interest in so-called explainable approaches - in essence methods that aim to make a Machine Learning model's decision process transparent to humans. Such methods are particularly needed when illogical or biased reasoning can lead to actual disadvantageous consequences for humans. Ideally, explainable Machine Learning should help detect such flaws in a model and aid a subsequent debugging process. One active line of research in Machine Learning explainability are gradient-based methods, which have been successfully applied to complex neural networks. Given that GPs are closed under differentiation, gradient-based explainability for GPs appears as a promising field of research. This paper is primarily focused on explaining GP classifiers via gradients where, contrary to GP regression, derivative GPs are not straightforward to obtain.  ( 2 min )
    Augmentation-induced Consistency Regularization for Classification. (arXiv:2205.12461v1 [cs.LG])
    Deep neural networks have become popular in many supervised learning tasks, but they may suffer from overfitting when the training dataset is limited. To mitigate this, many researchers use data augmentation, which is a widely used and effective method for increasing the variety of datasets. However, the randomness introduced by data augmentation causes inevitable inconsistency between training and inference, which leads to poor improvement. In this paper, we propose a consistency regularization framework based on data augmentation, called CR-Aug, which forces the output distributions of different sub models generated by data augmentation to be consistent with each other. Specifically, CR-Aug evaluates the discrepancy between the output distributions of two augmented versions of each sample, and it utilizes a stop-gradient operation to minimize the consistency loss. We implement CR-Aug to image and audio classification tasks and conduct extensive experiments to verify its effectiveness in improving the generalization ability of classifiers. Our CR-Aug framework is ready-to-use, it can be easily adapted to many state-of-the-art network architectures. Our empirical results show that CR-Aug outperforms baseline methods by a significant margin.  ( 2 min )
    Exact Convergence Rates of the Neural Tangent Kernel in the Large Depth Limit. (arXiv:1905.13654v11 [stat.ML] UPDATED)
    Recent work by Jacot et al. (2018) has shown that training a neural network using gradient descent in parameter space is related to kernel gradient descent in function space with respect to the Neural Tangent Kernel (NTK). Lee et al. (2019) built on this result by establishing that the output of a neural network trained using gradient descent can be approximated by a linear model when the network width is large. Indeed, under regularity conditions, the NTK converges to a time-independent kernel in the infinite-width limit. This regime is often called the NTK regime. In parallel, recent works on signal propagation (Poole et al., 2016; Schoenholz et al., 2017; Hayou et al., 2019a) studied the impact of the initialization and the activation function on signal propagation in deep neural networks. In this paper, we connect these two theories by quantifying the impact of the initialization and the activation function on the NTK when the network depth becomes large. In particular, we provide a comprehensive analysis of the convergence rates of the NTK regime to the infinite depth regime.  ( 2 min )
    FreeMatch: Self-adaptive Thresholding for Semi-supervised Learning. (arXiv:2205.07246v2 [cs.LG] UPDATED)
    Pseudo labeling and consistency regularization approaches based on confidence thresholding have made great progress in semi-supervised learning (SSL). However, we argue that existing methods might fail to adopt suitable thresholds since they either use a pre-defined / fixed threshold or an ad-hoc threshold adjusting scheme, resulting in inferior performance and slow convergence. We first analyze a motivating example to achieve some intuitions on the relationship between the desirable threshold and model's learning status. Based on the analysis, we hence propose FreeMatch to define and adjust the confidence threshold in a self-adaptive manner according to the model's learning status. We further introduce a self-adaptive class fairness regularization penalty that encourages the model to produce diverse predictions during the early stages of training. Extensive experimental results indicate the superiority of FreeMatch especially when the labeled data are extremely rare. FreeMatch achieves 5.78%, 13.59%, and 1.28% error rate reduction over the latest state-of-the-art method FlexMatch on CIFAR-10 with 1 label per class, STL-10 with 4 labels per class, and ImageNet with 100 labels per class, respectively.
    A Neural Tangent Kernel Formula for Ensembles of Soft Trees with Arbitrary Architectures. (arXiv:2205.12904v1 [cs.LG])
    A soft tree is an actively studied variant of a decision tree that updates splitting rules using the gradient method. Although it can have various tree architectures, the theoretical properties of their impact are not well known. In this paper, we formulate and analyze the Neural Tangent Kernel (NTK) induced by soft tree ensembles for arbitrary tree architectures. This kernel leads to the remarkable finding that only the number of leaves at each depth is relevant for the tree architecture in ensemble learning with infinitely many trees. In other words, if the number of leaves at each depth is fixed, the training behavior in function space and the generalization performance are exactly the same across different tree architectures, even if they are not isomorphic. We also show that the NTK of asymmetric trees like decision lists does not degenerate when they get infinitely deep. This is in contrast to the perfect binary trees, whose NTK is known to degenerate and leads to worse generalization performance for deeper trees.
    Amortized Inference for Causal Structure Learning. (arXiv:2205.12934v1 [cs.LG])
    Learning causal structure poses a combinatorial search problem that typically involves evaluating structures using a score or independence test. The resulting search is costly, and designing suitable scores or tests that capture prior knowledge is difficult. In this work, we propose to amortize the process of causal structure learning. Rather than searching over causal structures directly, we train a variational inference model to predict the causal structure from observational/interventional data. Our inference model acquires domain-specific inductive bias for causal discovery solely from data generated by a simulator. This allows us to bypass both the search over graphs and the hand-engineering of suitable score functions. Moreover, the architecture of our inference model is permutation invariant w.r.t. the data points and permutation equivariant w.r.t. the variables, facilitating generalization to significantly larger problem instances than seen during training. On synthetic data and semi-synthetic gene expression data, our models exhibit robust generalization capabilities under substantial distribution shift and significantly outperform existing algorithms, especially in the challenging genomics domain.
    Skill Machines: Temporal Logic Composition in Reinforcement Learning. (arXiv:2205.12532v1 [cs.LG])
    A major challenge in reinforcement learning is specifying tasks in a manner that is both interpretable and verifiable. One common approach is to specify tasks through reward machines -- finite state machines that encode the task to be solved. We introduce skill machines, a representation that can be learned directly from these reward machines that encode the solution to such tasks. We propose a framework where an agent first learns a set of base skills in a reward-free setting, and then combines these skills with the learned skill machine to produce composite behaviours specified by any regular language, such as linear temporal logics. This provides the agent with the ability to map from complex logical task specifications to near-optimal behaviours zero-shot. We demonstrate our approach in both a tabular and high-dimensional video game environment, where an agent is faced with several of these complex, long-horizon tasks. Our results indicate that the agent is capable of satisfying extremely complex task specifications, producing near optimal performance with no further learning. Finally, we demonstrate that the performance of skill machines can be improved with regular offline reinforcement learning algorithms when optimal behaviours are desired.
    Sharpness-Aware Minimization with Dynamic Reweighting. (arXiv:2112.08772v3 [cs.LG] UPDATED)
    Deep neural networks are often overparameterized and may not easily achieve model generalization. Adversarial training has shown effectiveness in improving generalization by regularizing the change of loss on top of adversarially chosen perturbations. The recently proposed sharpness-aware minimization (SAM) algorithm conducts adversarial weight perturbation, encouraging the model to converge to a flat minima. SAM finds a common adversarial weight perturbation per-batch. Although per-instance adversarial weight perturbations are stronger adversaries and they can potentially lead to better generalization performance, their computational cost is very high and thus it is impossible to use per-instance perturbations efficiently in SAM. In this paper, we tackle this efficiency bottleneck and propose sharpness-aware minimization with dynamic reweighting ({\delta}-SAM). Our theoretical analysis motivates that it is possible to approach the stronger, per-instance adversarial weight perturbations using reweighted per-batch weight perturbations. {\delta}-SAM dynamically reweights perturbation within each batch according to the theoretically principled weighting factors, serving as a good approximation to per-instance perturbation. Experiments on various natural language understanding tasks demonstrate the effectiveness of {\delta}-SAM.
    Ground-Truth Labels Matter: A Deeper Look into Input-Label Demonstrations. (arXiv:2205.12685v1 [cs.CL])
    Despite recent explosion in research interests, in-context learning and the precise impact of the quality of demonstrations remain elusive. While, based on current literature, it is expected that in-context learning shares a similar mechanism to supervised learning, Min et al. (2022) recently reported that, surprisingly, input-label correspondence is less important than other aspects of prompt demonstrations. Inspired by this counter-intuitive observation, we re-examine the importance of ground truth labels on in-context learning from diverse and statistical points of view. With the aid of the newly introduced metrics, i.e., Ground-truth Label Effect Ratio (GLER), demo-gain, and label sensitivity, we find that the impact of the correct input-label matching can vary according to different configurations. Expanding upon the previous key finding on the role of demonstrations, the complementary and contrastive results suggest that one might need to take more care when estimating the impact of each component in in-context learning demonstrations.
    NeuralPDE: Modelling Dynamical Systems from Data. (arXiv:2111.07671v2 [cs.LG] UPDATED)
    Many physical processes such as weather phenomena or fluid mechanics are governed by partial differential equations (PDEs). Modelling such dynamical systems using Neural Networks is an active research field. However, current methods are still very limited, as they do not exploit the knowledge about the dynamical nature of the system, require extensive prior knowledge about the governing equations or are limited to linear or first-order equations. In this work we make the observation that the Method of Lines used to solve PDEs can be represented using convolutions which makes convolutional neural networks (CNNs) the natural choice to parametrize arbitrary PDE dynamics. We combine this parametrization with differentiable ODE solvers to form the NeuralPDE Model, which explicitly takes into account the fact that the data is governed by differential equations. We show in several experiments on toy and real-world data that our model consistently outperforms state-of-the-art models used to learn dynamical systems.
    Autoformalization with Large Language Models. (arXiv:2205.12615v1 [cs.LG])
    Autoformalization is the process of automatically translating from natural language mathematics to formal specifications and proofs. A successful autoformalization system could advance the fields of formal verification, program synthesis, and artificial intelligence. While the long-term goal of autoformalization seemed elusive for a long time, we show large language models provide new prospects towards this goal. We make the surprising observation that LLMs can correctly translate a significant portion ($25.3\%$) of mathematical competition problems perfectly to formal specifications in Isabelle/HOL. We demonstrate the usefulness of this process by improving a previously introduced neural theorem prover via training on these autoformalized theorems. Our methodology results in a new state-of-the-art result on the MiniF2F theorem proving benchmark, improving the proof rate from $29.6\%$ to $35.2\%$.
    Eliciting Transferability in Multi-task Learning with Task-level Mixture-of-Experts. (arXiv:2205.12701v1 [cs.CL])
    Recent work suggests that transformer models are capable of multi-task learning on diverse NLP tasks. However, the potential of these models may be limited as they use the same set of parameters for all tasks. In contrast, humans tackle tasks in a more flexible way, by making proper presumptions on what skills and knowledge are relevant and executing only the necessary computations. Inspired by this, we propose to use task-level mixture-of-expert models, which has a collection of transformer layers (i.e., experts) and a router component to choose among these experts dynamically and flexibly. We show that the learned routing decisions and experts partially rediscover human categorization of NLP tasks -- certain experts are strongly associated with extractive tasks, some with classification tasks, and some with tasks requiring world knowledge.
    muNet: Evolving Pretrained Deep Neural Networks into Scalable Auto-tuning Multitask Systems. (arXiv:2205.10937v2 [cs.LG] UPDATED)
    Most uses of machine learning today involve training a model from scratch for a particular task, or sometimes starting with a model pretrained on a related task and then fine-tuning on a downstream task. Both approaches offer limited knowledge transfer between different tasks, time-consuming human-driven customization to individual tasks and high computational costs especially when starting from randomly initialized models. We propose a method that uses the layers of a pretrained deep neural network as building blocks to construct an ML system that can jointly solve an arbitrary number of tasks. The resulting system can leverage cross tasks knowledge transfer, while being immune from common drawbacks of multitask approaches such as catastrophic forgetting, gradients interference and negative transfer. We define an evolutionary approach designed to jointly select the prior knowledge relevant for each task, choose the subset of the model parameters to train and dynamically auto-tune its hyperparameters. Furthermore, a novel scale control method is employed to achieve quality/size trade-offs that outperform common fine-tuning techniques. Compared with standard fine-tuning on a benchmark of 10 diverse image classification tasks, the proposed model improves the average accuracy by 2.39% while using 47% less parameters per task.
    Tell me why! Explanations support learning relational and causal structure. (arXiv:2112.03753v3 [cs.LG] UPDATED)
    Inferring the abstract relational and causal structure of the world is a major challenge for reinforcement-learning (RL) agents. For humans, language--particularly in the form of explanations--plays a considerable role in overcoming this challenge. Here, we show that language can play a similar role for deep RL agents in complex environments. While agents typically struggle to acquire relational and causal knowledge, augmenting their experience by training them to predict language descriptions and explanations can overcome these limitations. We show that language can help agents learn challenging relational tasks, and examine which aspects of language contribute to its benefits. We then show that explanations can help agents to infer not only relational but also causal structure. Language can shape the way that agents to generalize out-of-distribution from ambiguous, causally-confounded training, and explanations even allow agents to learn to perform experimental interventions to identify causal relationships. Our results suggest that language description and explanation may be powerful tools for improving agent learning and generalization.
    Removing the fat from your posterior samples with margarine. (arXiv:2205.12841v1 [astro-ph.IM])
    Bayesian workflows often require the introduction of nuisance parameters, yet for core science modelling one needs access to a marginal posterior density. In this work we use masked autoregressive flows and kernel density estimators to encapsulate the marginal posterior, allowing us to compute marginal Kullback-Leibler divergences and marginal Bayesian model dimensionalities in addition to generating samples and computing marginal log probabilities. We demonstrate this in application to topical cosmological examples of the Dark Energy Survey, and global 21cm signal experiments. In addition to the computation of marginal Bayesian statistics, this work is important for further applications in Bayesian experimental design, complex prior modelling and likelihood emulation. This technique is made publicly available in the pip-installable code margarine.
    Recipe for a General, Powerful, Scalable Graph Transformer. (arXiv:2205.12454v1 [cs.LG])
    We propose a recipe on how to build a general, powerful, scalable (GPS) graph Transformer with linear complexity and state-of-the-art results on a diverse set of benchmarks. Graph Transformers (GTs) have gained popularity in the field of graph representation learning with a variety of recent publications but they lack a common foundation about what constitutes a good positional or structural encoding, and what differentiates them. In this paper, we summarize the different types of encodings with a clearer definition and categorize them as being $\textit{local}$, $\textit{global}$ or $\textit{relative}$. Further, GTs remain constrained to small graphs with few hundred nodes, and we propose the first architecture with a complexity linear to the number of nodes and edges $O(N+E)$ by decoupling the local real-edge aggregation from the fully-connected Transformer. We argue that this decoupling does not negatively affect the expressivity, with our architecture being a universal function approximator for graphs. Our GPS recipe consists of choosing 3 main ingredients: (i) positional/structural encoding, (ii) local message-passing mechanism, and (iii) global attention mechanism. We build and open-source a modular framework $\textit{GraphGPS}$ that supports multiple types of encodings and that provides efficiency and scalability both in small and large graphs. We test our architecture on 11 benchmarks and show very competitive results on all of them, show-casing the empirical benefits gained by the modularity and the combination of different strategies.
    Training Heterogeneous Features in Sequence to Sequence Tasks: Latent Enhanced Multi-filter Seq2Seq Model. (arXiv:2105.08840v3 [cs.CL] UPDATED)
    In language processing, training data with extremely large variance may lead to difficulty in the language model's convergence. It is difficult for the network parameters to adapt sentences with largely varied semantics or grammatical structures. To resolve this problem, we introduce a model that concentrates the each of the heterogeneous features in the input sentences. Building upon the encoder-decoder architecture, we design a latent-enhanced multi-filter seq2seq model (LEMS) that analyzes the input representations by introducing a latent space transformation and clustering. The representations are extracted from the final hidden state of the encoder and lie in the latent space. A latent space transformation is applied for enhancing the quality of the representations. Thus the clustering algorithm can easily separate samples based on the features of these representations. Multiple filters are trained by the features from their corresponding clusters, and the heterogeneity of the training data can be resolved accordingly. We conduct two sets of comparative experiments on semantic parsing and machine translation, using the Geo-query dataset and Multi30k English-French to demonstrate the enhancement our model has made respectively.
    RobustLR: Evaluating Robustness to Logical Perturbation in Deductive Reasoning. (arXiv:2205.12598v1 [cs.CL])
    Transformers have been shown to be able to perform deductive reasoning on a logical rulebase containing rules and statements written in English natural language. While the progress is promising, it is currently unclear if these models indeed perform logical reasoning by understanding the underlying logical semantics in the language. To this end, we propose RobustLR, a suite of evaluation datasets that evaluate the robustness of these models to minimal logical edits in rulebases and some standard logical equivalence conditions. In our experiments with RoBERTa and T5, we find that the models trained in prior works do not perform consistently on the different perturbations in RobustLR, thus showing that the models are not robust to the proposed logical perturbations. Further, we find that the models find it especially hard to learn logical negation and disjunction operators. Overall, using our evaluation sets, we demonstrate some shortcomings of the deductive reasoning-based language models, which can eventually help towards designing better models for logical reasoning over natural language.
    Boosting Tail Neural Network for Realtime Custom Keyword Spotting. (arXiv:2205.12933v1 [eess.AS])
    In this paper, we propose a Boosting Tail Neural Network (BTNN) for improving the performance of Realtime Custom Keyword Spotting (RCKS) that is still an industrial challenge for demanding powerful classification ability with limited computation resources. Inspired by Brain Science that a brain is only partly activated for a nerve simulation and numerous machine learning algorithms are developed to use a batch of weak classifiers to resolve arduous problems, which are often proved to be effective. We show that this method is helpful to the RCKS problem. The proposed approach achieve better performances in terms of wakeup rate and false alarm. In our experiments compared with those traditional algorithms that use only one strong classifier, it gets 18\% relative improvement. We also point out that this approach may be promising in future ASR exploration.
    Learning Mean Field Games: A Survey. (arXiv:2205.12944v1 [cs.LG])
    Non-cooperative and cooperative games with a very large number of players have many applications but remain generally intractable when the number of players increases. Introduced by Lasry and Lions, and Huang, Caines and Malham\'e, Mean Field Games (MFGs) rely on a mean-field approximation to allow the number of players to grow to infinity. Traditional methods for solving these games generally rely on solving partial or stochastic differential equations with a full knowledge of the model. Recently, Reinforcement Learning (RL) has appeared promising to solve complex problems. By combining MFGs and RL, we hope to solve games at a very large scale both in terms of population size and environment complexity. In this survey, we review the quickly growing recent literature on RL methods to learn Nash equilibria in MFGs. We first identify the most common settings (static, stationary, and evolutive). We then present a general framework for classical iterative methods (based on best-response computation or policy evaluation) to solve MFGs in an exact way. Building on these algorithms and the connection with Markov Decision Processes, we explain how RL can be used to learn MFG solutions in a model-free way. Last, we present numerical illustrations on a benchmark problem, and conclude with some perspectives.
    Optimizing Warfarin Dosing using Deep Reinforcement Learning. (arXiv:2202.03486v2 [cs.LG] UPDATED)
    Warfarin is a widely used anticoagulant, and has a narrow therapeutic range. Dosing of warfarin should be individualized, since slight overdosing or underdosing can have catastrophic or even fatal consequences. Despite much research on warfarin dosing, current dosing protocols do not live up to expectations, especially for patients sensitive to warfarin. We propose a deep reinforcement learning-based dosing model for warfarin. To overcome the issue of relatively small sample sizes in dosing trials, we use a Pharmacokinetic/ Pharmacodynamic (PK/PD) model of warfarin to simulate dose-responses of virtual patients. Applying the proposed algorithm on virtual test patients shows that this model outperforms a set of clinically accepted dosing protocols by a wide margin. We tested the robustness of our dosing protocol on a second PK/PD model and showed that its performance is comparable to the set of baseline protocols.
    Understanding Programmatic Weak Supervision via Source-aware Influence Function. (arXiv:2205.12879v1 [cs.LG])
    Programmatic Weak Supervision (PWS) aggregates the source votes of multiple weak supervision sources into probabilistic training labels, which are in turn used to train an end model. With its increasing popularity, it is critical to have some tool for users to understand the influence of each component (e.g., the source vote or training data) in the pipeline and interpret the end model behavior. To achieve this, we build on Influence Function (IF) and propose source-aware IF, which leverages the generation process of the probabilistic labels to decompose the end model's training objective and then calculate the influence associated with each (data, source, class) tuple. These primitive influence score can then be used to estimate the influence of individual component of PWS, such as source vote, supervision source, and training data. On datasets of diverse domains, we demonstrate multiple use cases: (1) interpreting incorrect predictions from multiple angles that reveals insights for debugging the PWS pipeline, (2) identifying mislabeling of sources with a gain of 9%-37% over baselines, and (3) improving the end model's generalization performance by removing harmful components in the training objective (13%-24% better than ordinary IF).
    Learning Distributions by Generative Adversarial Networks: Approximation and Generalization. (arXiv:2205.12601v1 [cs.LG])
    We study how well generative adversarial networks (GAN) learn probability distributions from finite samples by analyzing the convergence rates of these models. Our analysis is based on a new oracle inequality that decomposes the estimation error of GAN into the discriminator and generator approximation errors, generalization error and optimization error. To estimate the discriminator approximation error, we establish error bounds on approximating H\"older functions by ReLU neural networks, with explicit upper bounds on the Lipschitz constant of the network or norm constraint on the weights. For generator approximation error, we show that neural network can approximately transform a low-dimensional source distribution to a high-dimensional target distribution and bound such approximation error by the width and depth of neural network. Combining the approximation results with generalization bounds of neural networks from statistical learning theory, we establish the convergence rates of GANs in various settings, when the error is measured by a collection of integral probability metrics defined through H\"older classes, including the Wasserstein distance as a special case. In particular, for distributions concentrated around a low-dimensional set, we show that the convergence rates of GANs do not depend on the high ambient dimension, but on the lower intrinsic dimension.
    Non-Parametric Unsupervised Domain Adaptation for Neural Machine Translation. (arXiv:2109.06604v2 [cs.CL] UPDATED)
    Recently, $k$NN-MT has shown the promising capability of directly incorporating the pre-trained neural machine translation (NMT) model with domain-specific token-level $k$-nearest-neighbor ($k$NN) retrieval to achieve domain adaptation without retraining. Despite being conceptually attractive, it heavily relies on high-quality in-domain parallel corpora, limiting its capability on unsupervised domain adaptation, where in-domain parallel corpora are scarce or nonexistent. In this paper, we propose a novel framework that directly uses in-domain monolingual sentences in the target language to construct an effective datastore for $k$-nearest-neighbor retrieval. To this end, we first introduce an autoencoder task based on the target language, and then insert lightweight adapters into the original NMT model to map the token-level representation of this task to the ideal representation of translation task. Experiments on multi-domain datasets demonstrate that our proposed approach significantly improves the translation accuracy with target-side monolingual data, while achieving comparable performance with back-translation.
    From Noisy Prediction to True Label: Noisy Prediction Calibration via Generative Model. (arXiv:2205.00690v2 [cs.LG] UPDATED)
    Noisy labels are inevitable yet problematic in machine learning society. It ruins the generalization power of a classifier by making the classifier be trained to be overfitted to wrong labels. Existing methods on noisy label have focused on modifying classifier training procedure. It results in two possible problems. First, these methods are not applicable to a pre-trained classifier without further access into training. Second, it is not easy to train a classifier and remove all of negative effects from noisy labels simultaneously. From these problems, we suggests a new branch of approach, Noisy Prediction Calibration (NPC) in learning with noisy labels. Through the introduction and estimation of a new type of transition matrix via generative model, NPC corrects the noisy prediction from the pre-trained classifier to the true label as a post-processing scheme. We prove that NPC theoretically aligns with the transition matrix based methods. Yet, NPC provides more accurate pathway to estimate true label, even without involvement in classifier learning. Also, NPC is applicable to any classifier trained with noisy label methods, if training instances and its predictions are available. Our method, NPC, boosts the classification performances of all baseline models on both synthetic and real-world datasets.
    Certified Robustness Against Natural Language Attacks by Causal Intervention. (arXiv:2205.12331v1 [cs.LG])
    Deep learning models have achieved great success in many fields, yet they are vulnerable to adversarial examples. This paper follows a causal perspective to look into the adversarial vulnerability and proposes Causal Intervention by Semantic Smoothing (CISS), a novel framework towards robustness against natural language attacks. Instead of merely fitting observational data, CISS learns causal effects p(y|do(x)) by smoothing in the latent semantic space to make robust predictions, which scales to deep architectures and avoids tedious construction of noise customized for specific attacks. CISS is provably robust against word substitution attacks, as well as empirically robust even when perturbations are strengthened by unknown attack algorithms. For example, on YELP, CISS surpasses the runner-up by 6.7% in terms of certified robustness against word substitutions, and achieves 79.4% empirical robustness when syntactic attacks are integrated.
    Mirror Descent Maximizes Generalized Margin and Can Be Implemented Efficiently. (arXiv:2205.12808v1 [cs.LG])
    Driven by the empirical success and wide use of deep neural networks, understanding the generalization performance of overparameterized models has become an increasingly popular question. To this end, there has been substantial effort to characterize the implicit bias of the optimization algorithms used, such as gradient descent (GD), and the structural properties of their preferred solutions. This paper answers an open question in this literature: For the classification setting, what solution does mirror descent (MD) converge to? Specifically, motivated by its efficient implementation, we consider the family of mirror descent algorithms with potential function chosen as the $p$-th power of the $\ell_p$-norm, which is an important generalization of GD. We call this algorithm $p$-$\textsf{GD}$. For this family, we characterize the solutions it obtains and show that it converges in direction to a generalized maximum-margin solution with respect to the $\ell_p$-norm for linearly separable classification. While the MD update rule is in general expensive to compute and perhaps not suitable for deep learning, $p$-$\textsf{GD}$ is fully parallelizable in the same manner as SGD and can be used to train deep neural networks with virtually no additional computational overhead. Using comprehensive experiments with both linear and deep neural network models, we demonstrate that $p$-$\textsf{GD}$ can noticeably affect the structure and the generalization performance of the learned models.
    Graph Neural Networks Designed for Different Graph Types: A Survey. (arXiv:2204.03080v2 [cs.LG] UPDATED)
    Graphs are ubiquitous in nature and can therefore serve as models for many practical but also theoretical problems. Based on this, the young research field of Graph Neural Networks (GNNs) has emerged. Despite the youth of the field and the speed in which new models are developed, many good surveys have been published in the last years. Nevertheless, an overview on which graph types can be modeled by GNNs is missing. In this survey, we give a detailed overview of already existing GNNs and, unlike previous surveys, categorize them according to their ability to handle different graph types and properties. We consider GNNs operating on static as well as on dynamic graphs of different structural constitutions, with or without node or edge attributes. Moreover in the dynamic case, we separate the models in discrete-time and continuous-time dynamic graphs based on their architecture. While ordering the existing GNN models, we find, that there are still graph types, that are not or only rarely covered by existing GNN models. We point out where models are missing and give potential reasons for their absence.
    RADNet: Ensemble Model for Robust Glaucoma Classification in Color Fundus Images. (arXiv:2205.12902v1 [eess.IV])
    Glaucoma is one of the most severe eye diseases, characterized by rapid progression and leading to irreversible blindness. It is often the case that pathology diagnostics is carried out when the one's sight has already significantly degraded due to the lack of noticeable symptoms at early stage of the disease. Regular glaucoma screenings of the population shall improve early-stage detection, however the desirable frequency of etymological checkups is often not feasible due to excessive load imposed by manual diagnostics on limited number of specialists. Considering the basic methodology to detect glaucoma is to analyze fundus images for the \textit{optic-disc-to-optic-cup ratio}, Machine Learning domain can offer sophisticated tooling for image processing and classification. In our work, we propose an advanced image pre-processing technique combined with an ensemble of deep classification networks. Our \textit{Retinal Auto Detection (RADNet)} model has been successfully tested on Rotterdam EyePACS AIROGS train dataset with AUC of 0.92, and then additionally finetuned and tested on a fraction of RIM-ONE DL dataset with AUC of 0.91.
    Service Discovery in Social Internet of Things using Graph Neural Networks. (arXiv:2205.12711v1 [cs.LG])
    Internet-of-Things (IoT) networks intelligently connect thousands of physical entities to provide various services for the community. It is witnessing an exponential expansion, which is complicating the process of discovering IoT devices existing in the network and requesting corresponding services from them. As the highly dynamic nature of the IoT environment hinders the use of traditional solutions of service discovery, we aim, in this paper, to address this issue by proposing a scalable resource allocation neural model adequate for heterogeneous large-scale IoT networks. We devise a Graph Neural Network (GNN) approach that utilizes the social relationships formed between the devices in the IoT network to reduce the search space of any entity lookup and acquire a service from another device in the network. This proposed resource allocation approach surpasses standardization issues and embeds the structure and characteristics of the social IoT graph, by the means of GNNs, for eventual clustering analysis process. Simulation results applied on a real-world dataset illustrate the performance of this solution and its significant efficiency to operate on large-scale IoT networks.
    Core Challenges in Embodied Vision-Language Planning. (arXiv:2106.13948v4 [cs.LG] UPDATED)
    Recent advances in the areas of multimodal machine learning and artificial intelligence (AI) have led to the development of challenging tasks at the intersection of Computer Vision, Natural Language Processing, and Embodied AI. Whereas many approaches and previous survey pursuits have characterised one or two of these dimensions, there has not been a holistic analysis at the center of all three. Moreover, even when combinations of these topics are considered, more focus is placed on describing, e.g., current architectural methods, as opposed to also illustrating high-level challenges and opportunities for the field. In this survey paper, we discuss Embodied Vision-Language Planning (EVLP) tasks, a family of prominent embodied navigation and manipulation problems that jointly use computer vision and natural language. We propose a taxonomy to unify these tasks and provide an in-depth analysis and comparison of the new and current algorithmic approaches, metrics, simulated environments, as well as the datasets used for EVLP tasks. Finally, we present the core challenges that we believe new EVLP works should seek to address, and we advocate for task construction that enables model generalizability and furthers real-world deployment.
    Non-stationary Bandits with Knapsacks. (arXiv:2205.12427v1 [cs.LG])
    In this paper, we study the problem of bandits with knapsacks (BwK) in a non-stationary environment. The BwK problem generalizes the multi-arm bandit (MAB) problem to model the resource consumption associated with playing each arm. At each time, the decision maker/player chooses to play an arm, and s/he will receive a reward and consume certain amount of resource from each of the multiple resource types. The objective is to maximize the cumulative reward over a finite horizon subject to some knapsack constraints on the resources. Existing works study the BwK problem under either a stochastic or adversarial environment. Our paper considers a non-stationary environment which continuously interpolates between these two extremes. We first show that the traditional notion of variation budget is insufficient to characterize the non-stationarity of the BwK problem for a sublinear regret due to the presence of the constraints, and then we propose a new notion of global non-stationarity measure. We employ both non-stationarity measures to derive upper and lower bounds for the problem. Our results are based on a primal-dual analysis of the underlying linear programs and highlight the interplay between the constraints and the non-stationarity. Finally, we also extend the non-stationarity measure to the problem of online convex optimization with constraints and obtain new regret bounds accordingly.
    Conditional Gradients for the Approximately Vanishing Ideal. (arXiv:2202.03349v8 [cs.LG] UPDATED)
    The vanishing ideal of a set of points $X\subseteq \mathbb{R}^n$ is the set of polynomials that evaluate to $0$ over all points $\mathbf{x} \in X$ and admits an efficient representation by a finite set of polynomials called generators. To accommodate the noise in the data set, we introduce the Conditional Gradients Approximately Vanishing Ideal algorithm (CGAVI) for the construction of the set of generators of the approximately vanishing ideal. The constructed set of generators captures polynomial structures in data and gives rise to a feature map that can, for example, be used in combination with a linear classifier for supervised learning. In CGAVI, we construct the set of generators by solving specific instances of (constrained) convex optimization problems with the Pairwise Frank-Wolfe algorithm (PFW). Among other things, the constructed generators inherit the LASSO generalization bound and not only vanish on the training but also on out-sample data. Moreover, CGAVI admits a compact representation of the approximately vanishing ideal by constructing few generators with sparse coefficient vectors.
    Women, artificial intelligence, and key positions in collaboration networks: Towards a more equal scientific ecosystem. (arXiv:2205.12339v1 [cs.SI])
    Scientific collaboration in almost every discipline is mainly driven by the need of sharing knowledge, expertise, and pooled resources. Science is becoming more complex which has encouraged scientists to involve more in collaborative research projects in order to better address the challenges. As a highly interdisciplinary field with a rapidly evolving scientific landscape, artificial intelligence calls for researchers with special profiles covering a diverse set of skills and expertise. Understanding gender aspects of scientific collaboration is of paramount importance, especially in a field such as artificial intelligence that has been attracting large investments. Using social network analysis, natural language processing, and machine learning and focusing on artificial intelligence publications for the period from 2000 to 2019, in this work, we comprehensively investigated the effects of several driving factors on acquiring key positions in scientific collaboration networks through a gender lens. It was found that, regardless of gender, scientific performance in terms of quantity and impact plays a crucial in possessing the "social researcher" in the network. However, subtle differences were observed between female and male researchers in acquiring the "local influencer" role.
    A Deeper Understanding of State-Based Critics in Multi-Agent Reinforcement Learning. (arXiv:2201.01221v2 [cs.LG] UPDATED)
    Centralized Training for Decentralized Execution, where training is done in a centralized offline fashion, has become a popular solution paradigm in Multi-Agent Reinforcement Learning. Many such methods take the form of actor-critic with state-based critics, since centralized training allows access to the true system state, which can be useful during training despite not being available at execution time. State-based critics have become a common empirical choice, albeit one which has had limited theoretical justification or analysis. In this paper, we show that state-based critics can introduce bias in the policy gradient estimates, potentially undermining the asymptotic guarantees of the algorithm. We also show that, even if the state-based critics do not introduce any bias, they can still result in a larger gradient variance, contrary to the common intuition. Finally, we show the effects of the theories in practice by comparing different forms of centralized critics on a wide range of common benchmarks, and detail how various environmental properties are related to the effectiveness of different types of critics.
    Fast Stochastic Composite Minimization and an Accelerated Frank-Wolfe Algorithm under Parallelization. (arXiv:2205.12751v1 [math.OC])
    We consider the problem of minimizing the sum of two convex functions. One of those functions has Lipschitz-continuous gradients, and can be accessed via stochastic oracles, whereas the other is "simple". We provide a Bregman-type algorithm with accelerated convergence in function values to a ball containing the minimum. The radius of this ball depends on problem-dependent constants, including the variance of the stochastic oracle. We further show that this algorithmic setup naturally leads to a variant of Frank-Wolfe achieving acceleration under parallelization. More precisely, when minimizing a smooth convex function on a bounded domain, we show that one can achieve an $\epsilon$ primal-dual gap (in expectation) in $\tilde{O}(1/ \sqrt{\epsilon})$ iterations, by only accessing gradients of the original function and a linear maximization oracle with $O(1/\sqrt{\epsilon})$ computing units in parallel. We illustrate this fast convergence on synthetic numerical experiments.
    Is a Question Decomposition Unit All We Need?. (arXiv:2205.12538v1 [cs.CL])
    Large Language Models (LMs) have achieved state-of-the-art performance on many Natural Language Processing (NLP) benchmarks. With the growing number of new benchmarks, we build bigger and more complex LMs. However, building new LMs may not be an ideal option owing to the cost, time and environmental impact associated with it. We explore an alternative route: can we modify data by expressing it in terms of the model's strengths, so that a question becomes easier for models to answer? We investigate if humans can decompose a hard question into a set of simpler questions that are relatively easier for models to solve. We analyze a range of datasets involving various forms of reasoning and find that it is indeed possible to significantly improve model performance (24% for GPT3 and 29% for RoBERTa-SQuAD along with a symbolic calculator) via decomposition. Our approach provides a viable option to involve people in NLP research in a meaningful way. Our findings indicate that Human-in-the-loop Question Decomposition (HQD) can potentially provide an alternate path to building large LMs.
    Adaptively Exploiting d-Separators with Causal Bandits. (arXiv:2202.05100v2 [stat.ML] UPDATED)
    Multi-armed bandit problems provide a framework to identify the optimal intervention over a sequence of repeated experiments. Without additional assumptions, minimax optimal performance (measured by cumulative regret) is well-understood. With access to additional observed variables that d-separate the intervention from the outcome (i.e., they are a d-separator), recent "causal bandit" algorithms provably incur less regret. However, in practice it is desirable to be agnostic to whether observed variables are a d-separator. Ideally, an algorithm should be adaptive; that is, perform nearly as well as an algorithm with oracle knowledge of the presence or absence of a d-separator. In this work, we formalize and study this notion of adaptivity, and provide a novel algorithm that simultaneously achieves (a) optimal regret when a d-separator is observed, improving on classical minimax algorithms, and (b) significantly smaller regret than recent causal bandit algorithms when the observed variables are not a d-separator. Crucially, our algorithm does not require any oracle knowledge of whether a d-separator is observed. We also generalize this adaptivity to other conditions, such as the front-door criterion.
    Transportation-Inequalities, Lyapunov Stability and Sampling for Dynamical Systems on Continuous State Space. (arXiv:2205.12448v1 [stat.ML])
    We study the concentration phenomenon for discrete-time random dynamical systems with an unbounded state space. We develop a heuristic approach towards obtaining exponential concentration inequalities for dynamical systems using an entirely functional analytic framework. We also show that existence of exponential-type Lyapunov function, compared to the purely deterministic setting, not only implies stability but also exponential concentration inequalities for sampling from the stationary distribution, via \emph{transport-entropy inequality} (T-E). These results have significant impact in \emph{reinforcement learning} (RL) and \emph{controls}, leading to exponential concentration inequalities even for unbounded observables, while neither assuming reversibility nor exact knowledge of random dynamical system (assumptions at heart of concentration inequalities in statistical mechanics and Markov diffusion processes).
    FastAdaBelief: Improving Convergence Rate for Belief-based Adaptive Optimizers by Exploiting Strong Convexity. (arXiv:2104.13790v3 [cs.LG] UPDATED)
    AdaBelief, one of the current best optimizers, demonstrates superior generalization ability compared to the popular Adam algorithm by viewing the exponential moving average of observed gradients. AdaBelief is theoretically appealing in that it has a data-dependent $O(\sqrt{T})$ regret bound when objective functions are convex, where $T$ is a time horizon. It remains however an open problem whether the convergence rate can be further improved without sacrificing its generalization ability. %on how to exploit strong convexity to further improve the convergence rate of AdaBelief. To this end, we make a first attempt in this work and design a novel optimization algorithm called FastAdaBelief that aims to exploit its strong convexity in order to achieve an even faster convergence rate. In particular, by adjusting the step size that better considers strong convexity and prevents fluctuation, our proposed FastAdaBelief demonstrates excellent generalization ability as well as superior convergence. As an important theoretical contribution, we prove that FastAdaBelief attains a data-dependant $O(\log T)$ regret bound, which is substantially lower than AdaBelief. On the empirical side, we validate our theoretical analysis with extensive experiments in both scenarios of strong and non-strong convexity on three popular baseline models. Experimental results are very encouraging: FastAdaBelief converges the quickest in comparison to all mainstream algorithms while maintaining an excellent generalization ability, in cases of both strong or non-strong convexity. FastAdaBelief is thus posited as a new benchmark model for the research community.
    Detecting Multi-Sensor Fusion Errors in Advanced Driver-Assistance Systems. (arXiv:2109.06404v3 [cs.RO] UPDATED)
    Advanced Driver-Assistance Systems (ADAS) have been thriving and widely deployed in recent years. In general, these systems receive sensor data, compute driving decisions, and output control signals to the vehicles. To smooth out the uncertainties brought by sensor outputs, they usually leverage multi-sensor fusion (MSF) to fuse the sensor outputs and produce a more reliable understanding of the surroundings. However, MSF cannot completely eliminate the uncertainties since it lacks the knowledge about which sensor provides the most accurate data and how to optimally integrate the data provided by the sensors. As a result, critical consequences might happen unexpectedly. In this work, we observed that the popular MSF methods in an industry-grade ADAS can mislead the car control and result in serious safety hazards. We define the failures (e.g., car crashes) caused by the faulty MSF as fusion errors and develop a novel evolutionary-based domain-specific search framework, FusED, for the efficient detection of fusion errors. We further apply causality analysis to show that the found fusion errors are indeed caused by the MSF method. We evaluate our framework on two widely used MSF methods in two driving environments. Experimental results show that FusED identifies more than 150 fusion errors. Finally, we provide several suggestions to improve the MSF methods we study.
    Label Leakage and Protection from Forward Embedding in Vertical Federated Learning. (arXiv:2203.01451v3 [cs.LG] UPDATED)
    Vertical federated learning (vFL) has gained much attention and been deployed to solve machine learning problems with data privacy concerns in recent years. However, some recent work demonstrated that vFL is vulnerable to privacy leakage even though only the forward intermediate embedding (rather than raw features) and backpropagated gradients (rather than raw labels) are communicated between the involved participants. As the raw labels often contain highly sensitive information, some recent work has been proposed to prevent the label leakage from the backpropagated gradients effectively in vFL. However, these work only identified and defended the threat of label leakage from the backpropagated gradients. None of these work has paid attention to the problem of label leakage from the intermediate embedding. In this paper, we propose a practical label inference method which can steal private labels effectively from the shared intermediate embedding even though some existing protection methods such as label differential privacy and gradients perturbation are applied. The effectiveness of the label attack is inseparable from the correlation between the intermediate embedding and corresponding private labels. To mitigate the issue of label leakage from the forward embedding, we add an additional optimization goal at the label party to limit the label stealing ability of the adversary by minimizing the distance correlation between the intermediate embedding and corresponding private labels. We conducted massive experiments to demonstrate the effectiveness of our proposed protection methods.
    Interpretable Feature Engineering for Time Series Predictors using Attention Networks. (arXiv:2205.12723v1 [cs.LG])
    Regression problems with time-series predictors are common in banking and many other areas of application. In this paper, we use multi-head attention networks to develop interpretable features and use them to achieve good predictive performance. The customized attention layer explicitly uses multiplicative interactions and builds feature-engineering heads that capture temporal dynamics in a parsimonious manner. Convolutional layers are used to combine multivariate time series. We also discuss methods for handling static covariates in the modeling process. Visualization and explanation tools are used to interpret the results and explain the relationship between the inputs and the extracted features. Both simulation and real dataset are used to illustrate the usefulness of the methodology. Keyword: Attention heads, Deep neural networks, Interpretable feature engineering
    Learning Mixtures of Linear Dynamical Systems. (arXiv:2201.11211v2 [stat.ML] UPDATED)
    We study the problem of learning a mixture of multiple linear dynamical systems (LDSs) from unlabeled short sample trajectories, each generated by one of the LDS models. Despite the wide applicability of mixture models for time-series data, learning algorithms that come with end-to-end performance guarantees are largely absent from existing literature. There are multiple sources of technical challenges, including but not limited to (1) the presence of latent variables (i.e. the unknown labels of trajectories); (2) the possibility that the sample trajectories might have lengths much smaller than the dimension $d$ of the LDS models; and (3) the complicated temporal dependence inherent to time-series data. To tackle these challenges, we develop a two-stage meta-algorithm, which is guaranteed to efficiently recover each ground-truth LDS model up to error $\tilde{O}(\sqrt{d/T})$, where $T$ is the total sample size. We validate our theoretical studies with numerical experiments, confirming the efficacy of the proposed algorithm.
    Structured Uncertainty in the Observation Space of Variational Autoencoders. (arXiv:2205.12533v1 [cs.LG])
    Variational autoencoders (VAEs) are a popular class of deep generative models with many variants and a wide range of applications. Improvements upon the standard VAE mostly focus on the modelling of the posterior distribution over the latent space and the properties of the neural network decoder. In contrast, improving the model for the observational distribution is rarely considered and typically defaults to a pixel-wise independent categorical or normal distribution. In image synthesis, sampling from such distributions produces spatially-incoherent results with uncorrelated pixel noise, resulting in only the sample mean being somewhat useful as an output prediction. In this paper, we aim to stay true to VAE theory by improving the samples from the observational distribution. We propose an alternative model for the observation space, encoding spatial dependencies via a low-rank parameterisation. We demonstrate that this new observational distribution has the ability to capture relevant covariance between pixels, resulting in spatially-coherent samples. In contrast to pixel-wise independent distributions, our samples seem to contain semantically meaningful variations from the mean allowing the prediction of multiple plausible outputs with a single forward pass.
    Global geomagnetic perturbation forecasting using Deep Learning. (arXiv:2205.12734v1 [physics.space-ph])
    Geomagnetically Induced Currents (GICs) arise from spatio-temporal changes to Earth's magnetic field which arise from the interaction of the solar wind with Earth's magnetosphere, and drive catastrophic destruction to our technologically dependent society. Hence, computational models to forecast GICs globally with large forecast horizon, high spatial resolution and temporal cadence are of increasing importance to perform prompt necessary mitigation. Since GIC data is proprietary, the time variability of horizontal component of the magnetic field perturbation (dB/dt) is used as a proxy for GICs. In this work, we develop a fast, global dB/dt forecasting model, which forecasts 30 minutes into the future using only solar wind measurements as input. The model summarizes 2 hours of solar wind measurement using a Gated Recurrent Unit, and generates forecasts of coefficients which are folded with a spherical harmonic basis to enable global forecasts. When deployed, our model produces results in under a second, and generates global forecasts for horizontal magnetic perturbation components at 1-minute cadence. We evaluate our model across models in literature for two specific storms of 5 August 2011 and 17 March 2015, while having a self-consistent benchmark model set. Our model outperforms, or has consistent performance with state-of-the-practice high time cadence local and low time cadence global models, while also outperforming/having comparable performance with the benchmark models. Such quick inferences at high temporal cadence and arbitrary spatial resolutions may ultimately enable accurate forewarning of dB/dt for any place on Earth, resulting in precautionary measures to be taken in an informed manner.
    Label Leakage and Protection in Two-party Split Learning. (arXiv:2102.08504v3 [cs.LG] UPDATED)
    Two-party split learning is a popular technique for learning a model across feature-partitioned data. In this work, we explore whether it is possible for one party to steal the private label information from the other party during split training, and whether there are methods that can protect against such attacks. Specifically, we first formulate a realistic threat model and propose a privacy loss metric to quantify label leakage in split learning. We then show that there exist two simple yet effective methods within the threat model that can allow one party to accurately recover private ground-truth labels owned by the other party. To combat these attacks, we propose several random perturbation techniques, including $\texttt{Marvell}$, an approach that strategically finds the structure of the noise perturbation by minimizing the amount of label leakage (measured through our quantification metric) of a worst-case adversary. We empirically demonstrate the effectiveness of our protection techniques against the identified attacks, and show that $\texttt{Marvell}$ in particular has improved privacy-utility tradeoffs relative to baseline approaches.
    Domain Adaptation via Maximizing Surrogate Mutual Information. (arXiv:2110.12184v2 [cs.LG] UPDATED)
    Unsupervised domain adaptation (UDA) aims to predict unlabeled data from target domain with access to labeled data from the source domain. In this work, we propose a novel framework called SIDA (Surrogate Mutual Information Maximization Domain Adaptation) with strong theoretical guarantees. To be specific, SIDA implements adaptation by maximizing mutual information (MI) between features. In the framework, a surrogate joint distribution models the underlying joint distribution of the unlabeled target domain. Our theoretical analysis validates SIDA by bounding the expected risk on target domain with MI and surrogate distribution bias. Experiments show that our approach is comparable with state-of-the-art unsupervised adaptation methods on standard UDA tasks.
    Physics-guided Deep Markov Models for Learning Nonlinear Dynamical Systems with Uncertainty. (arXiv:2110.08607v3 [cs.LG] UPDATED)
    In this paper, we propose a probabilistic physics-guided framework, termed Physics-guided Deep Markov Model (PgDMM). The framework targets the inference of the characteristics and latent structure of nonlinear dynamical systems from measurement data, where exact inference of latent variables is typically intractable. A recently surfaced option pertains to leveraging variational inference to perform approximate inference. In such a scheme, transition and emission functions of the system are parameterized via feed-forward neural networks (deep generative models). However, due to the generalized and highly versatile formulation of neural network functions, the learned latent space often lacks physical interpretation and structured representation. To address this, we bridge physics-based state space models with Deep Markov Models, thus delivering a hybrid modeling framework for unsupervised learning and identification of nonlinear dynamical systems. The proposed framework takes advantage of the expressive power of deep learning, while retaining the driving physics of the dynamical system by imposing physics-driven restrictions on the side of the latent space. We demonstrate the benefits of such a fusion in terms of achieving improved performance on illustrative simulation examples and experimental case studies of nonlinear systems. Our results indicate that the physics-based models involved in the employed transition and emission functions essentially enforce a more structured and physically interpretable latent space, which is essential for enhancing and generalizing the predictive capabilities of deep learning-based models.
    Reward Uncertainty for Exploration in Preference-based Reinforcement Learning. (arXiv:2205.12401v1 [cs.LG])
    Conveying complex objectives to reinforcement learning (RL) agents often requires meticulous reward engineering. Preference-based RL methods are able to learn a more flexible reward model based on human preferences by actively incorporating human feedback, i.e. teacher's preferences between two clips of behaviors. However, poor feedback-efficiency still remains a problem in current preference-based RL algorithms, as tailored human feedback is very expensive. To handle this issue, previous methods have mainly focused on improving query selection and policy initialization. At the same time, recent exploration methods have proven to be a recipe for improving sample-efficiency in RL. We present an exploration method specifically for preference-based RL algorithms. Our main idea is to design an intrinsic reward by measuring the novelty based on learned reward. Specifically, we utilize disagreement across ensemble of learned reward models. Our intuition is that disagreement in learned reward model reflects uncertainty in tailored human feedback and could be useful for exploration. Our experiments show that exploration bonus from uncertainty in learned reward improves both feedback- and sample-efficiency of preference-based RL algorithms on complex robot manipulation tasks from MetaWorld benchmarks, compared with other existing exploration methods that measure the novelty of state visitation.
    Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT. (arXiv:2205.12399v1 [cs.LG])
    We combine the capacity of sparsely gated Mixture-of-Experts (MoE) with the speed and stability of linear, mixing transformations to design the Sparse Mixer encoder model. The Sparse Mixer slightly outperforms (<1%) BERT on GLUE and SuperGLUE, but more importantly trains 65% faster and runs inference 61% faster. We also present a faster variant, prosaically named Fast Sparse Mixer, that marginally underperforms (<0.2%) BERT on SuperGLUE, but trains and runs nearly twice as fast: 89% faster training and 98% faster inference. We justify the design of these two models by carefully ablating through various mixing mechanisms, MoE configurations and model hyperparameters. The Sparse Mixer overcomes many of the latency and stability concerns of MoE models and offers the prospect of serving sparse student models, without resorting to distilling them to dense variants.
    Tiered Reinforcement Learning: Pessimism in the Face of Uncertainty and Constant Regret. (arXiv:2205.12418v1 [cs.LG])
    We propose a new learning framework that captures the tiered structure of many real-world user-interaction applications, where the users can be divided into two groups based on their different tolerance on exploration risks and should be treated separately. In this setting, we simultaneously maintain two policies $\pi^{\text{O}}$ and $\pi^{\text{E}}$: $\pi^{\text{O}}$ ("O" for "online") interacts with more risk-tolerant users from the first tier and minimizes regret by balancing exploration and exploitation as usual, while $\pi^{\text{E}}$ ("E" for "exploit") exclusively focuses on exploitation for risk-averse users from the second tier utilizing the data collected so far. An important question is whether such a separation yields advantages over the standard online setting (i.e., $\pi^{\text{E}}=\pi^{\text{O}}$) for the risk-averse users. We individually consider the gap-independent vs.~gap-dependent settings. For the former, we prove that the separation is indeed not beneficial from a minimax perspective. For the latter, we show that if choosing Pessimistic Value Iteration as the exploitation algorithm to produce $\pi^{\text{E}}$, we can achieve a constant regret for risk-averse users independent of the number of episodes $K$, which is in sharp contrast to the $\Omega(\log K)$ regret for any online RL algorithms in the same setting, while the regret of $\pi^{\text{O}}$ (almost) maintains its online regret optimality and does not need to compromise for the success of $\pi^{\text{E}}$.
    GuardNN: Secure Accelerator Architecture for Privacy-Preserving Deep Learning. (arXiv:2008.11632v2 [cs.CR] UPDATED)
    This paper proposes GuardNN, a secure DNN accelerator that provides hardware-based protection for user data and model parameters even in an untrusted environment. GuardNN shows that the architecture and protection can be customized for a specific application to provide strong confidentiality and integrity guarantees with negligible overhead. The design of the GuardNN instruction set reduces the TCB to just the accelerator and allows confidentiality protection even when the instructions from a host cannot be trusted. GuardNN minimizes the overhead of memory encryption and integrity verification by customizing the off-chip memory protection for the known memory access patterns of a DNN accelerator. GuardNN is prototyped on an FPGA, demonstrating effective confidentiality protection with ~3% performance overhead for inference.
    On the Interpretability of Regularisation for Neural Networks Through Model Gradient Similarity. (arXiv:2205.12642v1 [stat.ML])
    Most complex machine learning and modelling techniques are prone to over-fitting and may subsequently generalise poorly to future data. Artificial neural networks are no different in this regard and, despite having a level of implicit regularisation when trained with gradient descent, often require the aid of explicit regularisers. We introduce a new framework, Model Gradient Similarity (MGS), that (1) serves as a metric of regularisation, which can be used to monitor neural network training, (2) adds insight into how explicit regularisers, while derived from widely different principles, operate via the same mechanism underneath by increasing MGS, and (3) provides the basis for a new regularisation scheme which exhibits excellent performance, especially in challenging settings such as high levels of label noise or limited sample sizes.
    Convolutional Neural Processes for Inpainting Satellite Images. (arXiv:2205.12407v1 [cs.CV])
    The widespread availability of satellite images has allowed researchers to model complex systems such as disease dynamics. However, many satellite images have missing values due to measurement defects, which render them unusable without data imputation. For example, the scanline corrector for the LANDSAT 7 satellite broke down in 2003, resulting in a loss of around 20\% of its data. Inpainting involves predicting what is missing based on the known pixels and is an old problem in image processing, classically based on PDEs or interpolation methods, but recent deep learning approaches have shown promise. However, many of these methods do not explicitly take into account the inherent spatiotemporal structure of satellite images. In this work, we cast satellite image inpainting as a natural meta-learning problem, and propose using convolutional neural processes (ConvNPs) where we frame each satellite image as its own task or 2D regression problem. We show ConvNPs can outperform classical methods and state-of-the-art deep learning inpainting models on a scanline inpainting problem for LANDSAT 7 satellite images, assessed on a variety of in and out-of-distribution images.
    Learning from time-dependent streaming data with online stochastic algorithms. (arXiv:2205.12549v1 [cs.LG])
    We study stochastic algorithms in a streaming framework, trained on samples coming from a dependent data source. In this streaming framework, we analyze the convergence of Stochastic Gradient (SG) methods in a non-asymptotic manner; this includes various SG methods such as the well-known stochastic gradient descent (i.e., Robbins-Monro algorithm), mini-batch SG methods, together with their averaged estimates (i.e., Polyak-Ruppert averaged). Our results form a heuristic by linking the level of dependency and convexity to the rest of the model parameters. This heuristic provides new insights into choosing the optimal learning rate, which can help increase the stability of SGbased methods; these investigations suggest large streaming batches with slow decaying learning rates for highly dependent data sources.
    Stochastic Second-Order Methods Provably Beat SGD For Gradient-Dominated Functions. (arXiv:2205.12856v1 [cs.LG])
    We study the performance of Stochastic Cubic Regularized Newton (SCRN) on a class of functions satisfying gradient dominance property which holds in a wide range of applications in machine learning and signal processing. This condition ensures that any first-order stationary point is a global optimum. We prove that SCRN improves the best-known sample complexity of stochastic gradient descent in achieving $\epsilon$-global optimum by a factor of $\mathcal{O}(\epsilon^{-1/2})$. Even under a weak version of gradient dominance property, which is applicable to policy-based reinforcement learning (RL), SCRN achieves the same improvement over stochastic policy gradient methods. Additionally, we show that the sample complexity of SCRN can be improved by a factor of ${\mathcal{O}}(\epsilon^{-1/2})$ using a variance reduction method with time-varying batch sizes. Experimental results in various RL settings showcase the remarkable performance of SCRN compared to first-order methods.
    Learning the Travelling Salesperson Problem Requires Rethinking Generalization. (arXiv:2006.07054v6 [cs.LG] UPDATED)
    End-to-end training of neural network solvers for graph combinatorial optimization problems such as the Travelling Salesperson Problem (TSP) have seen a surge of interest recently, but remain intractable and inefficient beyond graphs with few hundreds of nodes. While state-of-the-art learning-driven approaches for TSP perform closely to classical solvers when trained on trivially small sizes, they are unable to generalize the learnt policy to larger instances at practical scales. This work presents an end-to-end neural combinatorial optimization pipeline that unifies several recent papers in order to identify the inductive biases, model architectures and learning algorithms that promote generalization to instances larger than those seen in training. Our controlled experiments provide the first principled investigation into such zero-shot generalization, revealing that extrapolating beyond training data requires rethinking the neural combinatorial optimization pipeline, from network layers and learning paradigms to evaluation protocols. Additionally, we analyze recent advances in deep learning for routing problems through the lens of our pipeline and provide new directions to stimulate future research.
    Hardness of Maximum Likelihood Learning of DPPs. (arXiv:2205.12377v1 [cs.CC])
    Determinantal Point Processes (DPPs) are a widely used probabilistic model for negatively correlated sets. DPPs have been successfully employed in Machine Learning applications to select a diverse, yet representative subset of data. In seminal work on DPPs in Machine Learning, Kulesza conjectured in his PhD Thesis (2011) that the problem is NP-complete. The lack of a formal proof prompted Brunel, Moitra, Rigollet and Urschel (COLT 2017) to conjecture that, in opposition to Kulesza's conjecture, there exists a polynomial-time algorithm for computing a maximum-likelihood DPP. They also presented some preliminary evidence supporting their conjecture. In this work we prove Kulesza's conjecture. In fact, we prove the following stronger hardness of approximation result: even computing a $\left(1-O(\frac{1}{\log^9{N}})\right)$-approximation to the maximum log-likelihood of a DPP on a ground set of $N$ elements is NP-complete. At the same time, we also obtain the first polynomial-time algorithm that achieves a nontrivial worst-case approximation to the optimal log-likelihood: the approximation factor is $\frac{1}{(1+o(1))\log{m}}$ unconditionally (for data sets that consist of $m$ subsets), and can be improved to $1-\frac{1+o(1)}{\log N}$ if all $N$ elements appear in a $O(1/N)$-fraction of the subsets. In terms of techniques, we reduce approximating the maximum log-likelihood of DPPs on a data set to solving a gap instance of a "vector coloring" problem on a hypergraph. Such a hypergraph is built on a bounded-degree graph construction of Bogdanov, Obata and Trevisan (FOCS 2002), and is further enhanced by the strong expanders of Alon and Capalbo (FOCS 2007) to serve our purposes.
    Action Recognition for American Sign Language. (arXiv:2205.12261v1 [cs.CV])
    In this research, we present our findings to recognize American Sign Language from series of hand gestures. While most researches in literature focus only on static handshapes, our work target dynamic hand gestures. Since dynamic signs dataset are very few, we collect an initial dataset of 150 videos for 10 signs and an extension of 225 videos for 15 signs. We apply transfer learning models in combination with deep neural networks and background subtraction for videos in different temporal settings. Our primarily results show that we can get an accuracy of $0.86$ and $0.71$ using DenseNet201, LSTM with video sequence of 12 frames accordingly.
    Learning JPEG Compression Artifacts for Image Manipulation Detection and Localization. (arXiv:2108.12947v2 [eess.IV] UPDATED)
    Detecting and localizing image manipulation are necessary to counter malicious use of image editing techniques. Accordingly, it is essential to distinguish between authentic and tampered regions by analyzing intrinsic statistics in an image. We focus on JPEG compression artifacts left during image acquisition and editing. We propose a convolutional neural network (CNN) that uses discrete cosine transform (DCT) coefficients, where compression artifacts remain, to localize image manipulation. Standard CNNs cannot learn the distribution of DCT coefficients because the convolution throws away the spatial coordinates, which are essential for DCT coefficients. We illustrate how to design and train a neural network that can learn the distribution of DCT coefficients. Furthermore, we introduce Compression Artifact Tracing Network (CAT-Net) that jointly uses image acquisition artifacts and compression artifacts. It significantly outperforms traditional and deep neural network-based methods in detecting and localizing tampered regions.
    Learning to Model Editing Processes. (arXiv:2205.12374v1 [cs.CL])
    Most existing sequence generation models produce outputs in one pass, usually left-to-right. However, this is in contrast with a more natural approach that humans use in generating content; iterative refinement and editing. Recent work has introduced edit-based models for various tasks (such as neural machine translation and text style transfer), but these generally model a single edit step. In this work, we propose modeling editing processes, modeling the whole process of iteratively generating sequences. We form a conceptual framework to describe the likelihood of multi-step edits, and describe neural models that can learn a generative model of sequences based on these multistep edits. We introduce baseline results and metrics on this task, finding that modeling editing processes improves performance on a variety of axes on both our proposed task and related downstream tasks compared to previous single-step models of edits.
    AdaMix: Mixture-of-Adapter for Parameter-efficient Tuning of Large Language Models. (arXiv:2205.12410v1 [cs.CL])
    Fine-tuning large-scale pre-trained language models to downstream tasks require updating hundreds of millions of parameters. This not only increases the serving cost to store a large copy of the model weights for every task, but also exhibits instability during few-shot task adaptation. Parameter-efficient techniques have been developed that tune small trainable components (e.g., adapters) injected in the large model while keeping most of the model weights frozen. The prevalent mechanism to increase adapter capacity is to increase the bottleneck dimension which increases the adapter parameters. In this work, we introduce a new mechanism to improve adapter capacity without increasing parameters or computational cost by two key techniques. (i) We introduce multiple shared adapter components in each layer of the Transformer architecture. We leverage sparse learning via random routing to update the adapter parameters (encoder is kept frozen) resulting in the same amount of computational cost (FLOPs) as that of training a single adapter. (ii) We propose a simple merging mechanism to average the weights of multiple adapter components to collapse to a single adapter in each Transformer layer, thereby, keeping the overall parameters also the same but with significant performance improvement. We demonstrate these techniques to work well across multiple task settings including fully supervised and few-shot Natural Language Understanding tasks. By only tuning 0.23% of a pre-trained language model's parameters, our model outperforms the full model fine-tuning performance and several competing methods.
    Over-the-Air Design of GAN Training for mmWave MIMO Channel Estimation. (arXiv:2205.12445v1 [eess.SP])
    Future wireless systems are trending towards higher carrier frequencies that offer larger communication bandwidth but necessitate the use of large antenna arrays. Existing signal processing techniques for channel estimation do not scale well to this "high-dimensional" regime in terms of performance and pilot overhead. Meanwhile, training deep learning based approaches for channel estimation requires large labeled datasets mapping pilot measurements to clean channel realizations, which can only be generated offline using simulated channels. In this paper, we develop a novel unsupervised over-the-air (OTA) algorithm that utilizes noisy received pilot measurements to train a deep generative model to output beamspace MIMO channel realizations. Our approach leverages Generative Adversarial Networks (GAN), while using a conditional input to distinguish between Line-of-Sight (LOS) and Non-Line-of-Sight (NLOS) channel realizations. We also present a federated implementation of the OTA algorithm that distributes the GAN training over multiple users and greatly reduces the user side computation. We then formulate channel estimation from a limited number of pilot measurements as an inverse problem and reconstruct the channel by optimizing the input vector of the trained generative model. Our proposed approach significantly outperforms Orthogonal Matching Pursuit on both LOS and NLOS channel models, and EM-GM-AMP -- an Approximate Message Passing algorithm -- on LOS channel models, while achieving comparable performance on NLOS channel models in terms of the normalized channel reconstruction error. More importantly, our proposed framework has the potential to be trained online using real noisy pilot measurements, is not restricted to a specific channel model and can even be utilized for a federated OTA design of a dataset generator from noisy data.
    VeriFi: Towards Verifiable Federated Unlearning. (arXiv:2205.12709v1 [cs.CR])
    Federated learning (FL) is a collaborative learning paradigm where participants jointly train a powerful model without sharing their private data. One desirable property for FL is the implementation of the right to be forgotten (RTBF), i.e., a leaving participant has the right to request to delete its private data from the global model. However, unlearning itself may not be enough to implement RTBF unless the unlearning effect can be independently verified, an important aspect that has been overlooked in the current literature. In this paper, we prompt the concept of verifiable federated unlearning, and propose VeriFi, a unified framework integrating federated unlearning and verification that allows systematic analysis of the unlearning and quantification of its effect, with different combinations of multiple unlearning and verification methods. In VeriFi, the leaving participant is granted the right to verify (RTV), that is, the participant notifies the server before leaving, then actively verifies the unlearning effect in the next few communication rounds. The unlearning is done at the server side immediately after receiving the leaving notification, while the verification is done locally by the leaving participant via two steps: marking (injecting carefully-designed markers to fingerprint the leaver) and checking (examining the change of the global model's performance on the markers). Based on VeriFi, we conduct the first systematic and large-scale study for verifiable federated unlearning, considering 7 unlearning methods and 5 verification methods. Particularly, we propose a more efficient and FL-friendly unlearning method, and two more effective and robust non-invasive-verification methods. We extensively evaluate VeriFi on 7 datasets and 4 types of deep learning models. Our analysis establishes important empirical understandings for more trustworthy federated unlearning.
    A Kernel Stein Test for Comparing Latent Variable Models. (arXiv:1907.00586v4 [stat.ML] UPDATED)
    We propose a kernel-based nonparametric test of relative goodness of fit, where the goal is to compare two models, both of which may have unobserved latent variables, such that the marginal distribution of the observed variables is intractable. The proposed test generalizes the recently proposed kernel Stein discrepancy (KSD) tests (Liu et al., 2016, Chwialkowski et al., 2016, Yang et al., 2018) to the case of latent variable models, a much more general class than the fully observed models treated previously. The new test, with a properly calibrated threshold, has a well-controlled type-I error. In the case of certain models with low-dimensional latent structure and high-dimensional observations, our test significantly outperforms the relative Maximum Mean Discrepancy test, which is based on samples from the models and does not exploit the latent structure.
    MAPLE: Microprocessor A Priori for Latency Estimation. (arXiv:2111.15106v2 [cs.LG] UPDATED)
    Modern deep neural networks must demonstrate state-of-the-art accuracy while exhibiting low latency and energy consumption. As such, neural architecture search (NAS) algorithms take these two constraints into account when generating a new architecture. However, efficiency metrics such as latency are typically hardware dependent requiring the NAS algorithm to either measure or predict the architecture latency. Measuring the latency of every evaluated architecture adds a significant amount of time to the NAS process. Here we propose Microprocessor A Priori for Latency Estimation MAPLE that does not rely on transfer learning or domain adaptation but instead generalizes to new hardware by incorporating a prior hardware characteristics during training. MAPLE takes advantage of a novel quantitative strategy to characterize the underlying microprocessor by measuring relevant hardware performance metrics, yielding a fine-grained and expressive hardware descriptor. Moreover, the proposed MAPLE benefits from the tightly coupled I/O between the CPU and GPU and their dependency to predict DNN latency on GPUs while measuring microprocessor performance hardware counters from the CPU feeding the GPU hardware. Through this quantitative strategy as the hardware descriptor, MAPLE can generalize to new hardware via a few shot adaptation strategy where with as few as 3 samples it exhibits a 6% improvement over state-of-the-art methods requiring as much as 10 samples. Experimental results showed that, increasing the few shot adaptation samples to 10 improves the accuracy significantly over the state-of-the-art methods by 12%. Furthermore, it was demonstrated that MAPLE exhibiting 8-10% better accuracy, on average, compared to relevant baselines at any number of adaptation samples.
    When Is Partially Observable Reinforcement Learning Not Scary?. (arXiv:2204.08967v2 [cs.LG] UPDATED)
    Applications of Reinforcement Learning (RL), in which agents learn to make a sequence of decisions despite lacking complete information about the latent states of the controlled system, that is, they act under partial observability of the states, are ubiquitous. Partially observable RL can be notoriously difficult -- well-known information-theoretic results show that learning partially observable Markov decision processes (POMDPs) requires an exponential number of samples in the worst case. Yet, this does not rule out the existence of large subclasses of POMDPs over which learning is tractable. In this paper we identify such a subclass, which we call weakly revealing POMDPs. This family rules out the pathological instances of POMDPs where observations are uninformative to a degree that makes learning hard. We prove that for weakly revealing POMDPs, a simple algorithm combining optimism and Maximum Likelihood Estimation (MLE) is sufficient to guarantee polynomial sample complexity. To the best of our knowledge, this is the first provably sample-efficient result for learning from interactions in overcomplete POMDPs, where the number of latent states can be larger than the number of observations.
    ColdGuess: A General and Effective Relational Graph Convolutional Network to Tackle Cold Start Cases. (arXiv:2205.12318v1 [cs.LG])
    Low-quality listings and bad actor behavior in online retail websites threatens e-commerce business as these result in sub-optimal buying experience and erode customer trust. When a new listing is created, how to tell it has good-quality? Is the method effective, fast, and scalable? Previous approaches often have three limitations/challenges: (1) unable to handle cold start problems where new sellers/listings lack sufficient selling histories. (2) inability of scoring hundreds of millions of listings at scale, or compromise performance for scalability. (3) has space challenges from large-scale graph with giant e-commerce business size. To overcome these limitations/challenges, we proposed ColdGuess, an inductive graph-based risk predictor built upon a heterogeneous seller product graph, which effectively identifies risky seller/product/listings at scale. ColdGuess tackles the large-scale graph by consolidated nodes, and addresses the cold start problems using homogeneous influence1. The evaluation on real data demonstrates that ColdGuess has stable performance as the number of unknown features increases. It outperforms the lightgbm2 by up to 34 pcp ROC-AUC in a cold start case when a new seller sells a new product . The resulting system, ColdGuess, is effective, adaptable to changing risky seller behavior, and is already in production
    Principal Components Bias in Over-parameterized Linear Models, and its Manifestation in Deep Neural Networks. (arXiv:2105.05553v7 [cs.LG] UPDATED)
    Recent work suggests that convolutional neural networks of different architectures learn to classify images in the same order. To understand this phenomenon, we revisit the over-parametrized deep linear network model. Our analysis reveals that, when the hidden layers are wide enough, the convergence rate of this model's parameters is exponentially faster along the directions of the larger principal components of the data, at a rate governed by the corresponding singular values. We term this convergence pattern the Principal Components bias (PC-bias). Empirically, we show how the PC-bias streamlines the order of learning of both linear and non-linear networks, more prominently at earlier stages of learning. We then compare our results to the simplicity bias, showing that both biases can be seen independently, and affect the order of learning in different ways. Finally, we discuss how the PC-bias may explain some benefits of early stopping and its connection to PCA, and why deep networks converge more slowly with random labels.
    Fast & Furious: Modelling Malware Detection as Evolving Data Streams. (arXiv:2205.12311v1 [cs.CR])
    Malware is a major threat to computer systems and imposes many challenges to cyber security. Targeted threats, such as ransomware, cause millions of dollars in losses every year. The constant increase of malware infections has been motivating popular antiviruses (AVs) to develop dedicated detection strategies, which include meticulously crafted machine learning (ML) pipelines. However, malware developers unceasingly change their samples features to bypass detection. This constant evolution of malware samples causes changes to the data distribution (i.e., concept drifts) that directly affect ML model detection rates. In this work, we evaluate the impact of concept drift on malware classifiers for two Android datasets: DREBIN (~130K apps) and AndroZoo (~350K apps). Android is a ubiquitous operating system for smartphones, which stimulates attackers to regularly create and update malware to the platform. We conducted a longitudinal evaluation by (i) classifying malware samples collected over nine years (2009-2018), (ii) reviewing concept drift detection algorithms to attest its pervasiveness, (iii) comparing distinct ML approaches to mitigate the issue, and (iv) proposing an ML data stream pipeline that outperformed literature approaches. As a result, we observed that updating every component of the pipeline in response to concept drifts allows the classification model to achieve increasing detection rates as the data representation (extracted features) is updated. Furthermore, we discuss the impact of the changes on the classification models by comparing the variations in the extracted features.
    TorchNTK: A Library for Calculation of Neural Tangent Kernels of PyTorch Models. (arXiv:2205.12372v1 [cs.LG])
    We introduce torchNTK, a python library to calculate the empirical neural tangent kernel (NTK) of neural network models in the PyTorch framework. We provide an efficient method to calculate the NTK of multilayer perceptrons. We compare the explicit differentiation implementation against autodifferentiation implementations, which have the benefit of extending the utility of the library to any architecture supported by PyTorch, such as convolutional networks. A feature of the library is that we expose the user to layerwise NTK components, and show that in some regimes a layerwise calculation is more memory efficient. We conduct preliminary experiments to demonstrate use cases for the software and probe the NTK.
    Multi-Agent Low-Dimensional Linear Bandits. (arXiv:2007.01442v4 [cs.LG] UPDATED)
    We study a multi-agent stochastic linear bandit with side information, parameterized by an unknown vector $\theta^* \in \mathbb{R}^d$. The side information consists of a finite collection of low-dimensional subspaces, one of which contains $\theta^*$. In our setting, agents can collaborate to reduce regret by sending recommendations across a communication graph connecting them. We present a novel decentralized algorithm, where agents communicate subspace indices with each other and each agent plays a projected variant of LinUCB on the corresponding (low-dimensional) subspace. By distributing the search for the optimal subspace across users and learning of the unknown vector by each agent in the corresponding low-dimensional subspace, we show that the per-agent finite-time regret is much smaller than the case when agents do not communicate. We finally complement these results through simulations.
    EGR: Equivariant Graph Refinement and Assessment of 3D Protein Complex Structures. (arXiv:2205.10390v2 [cs.LG] UPDATED)
    Protein complexes are macromolecules essential to the functioning and well-being of all living organisms. As the structure of a protein complex, in particular its region of interaction between multiple protein subunits (i.e., chains), has a notable influence on the biological function of the complex, computational methods that can quickly and effectively be used to refine and assess the quality of a protein complex's 3D structure can directly be used within a drug discovery pipeline to accelerate the development of new therapeutics and improve the efficacy of future vaccines. In this work, we introduce the Equivariant Graph Refiner (EGR), a novel E(3)-equivariant graph neural network (GNN) for multi-task structure refinement and assessment of protein complexes. Our experiments on new, diverse protein complex datasets, all of which we make publicly available in this work, demonstrate the state-of-the-art effectiveness of EGR for atomistic refinement and assessment of protein complexes and outline directions for future work in the field. In doing so, we establish a baseline for future studies in macromolecular refinement and structure analysis.
    K-12BERT: BERT for K-12 education. (arXiv:2205.12335v1 [cs.CL])
    Online education platforms are powered by various NLP pipelines, which utilize models like BERT to aid in content curation. Since the inception of the pre-trained language models like BERT, there have also been many efforts toward adapting these pre-trained models to specific domains. However, there has not been a model specifically adapted for the education domain (particularly K-12) across subjects to the best of our knowledge. In this work, we propose to train a language model on a corpus of data curated by us across multiple subjects from various sources for K-12 education. We also evaluate our model, K12-BERT, on downstream tasks like hierarchical taxonomy tagging.
    A comparative study of non-deep learning, deep learning, and ensemble learning methods for sunspot number prediction. (arXiv:2203.05757v2 [astro-ph.SR] UPDATED)
    Solar activity has significant impacts on human activities and health. One most commonly used measure of solar activity is the sunspot number. This paper compares three important non-deep learning models, four popular deep learning models, and their five ensemble models in forecasting sunspot numbers. In particular, we propose an ensemble model called XGBoost-DL, which uses XGBoost as a two-level nonlinear ensemble method to combine the deep learning models. Our XGBoost-DL achieves the best forecasting performance (RMSE = 25.70 and MAE = 19.82) in the comparison, outperforming the best non-deep learning model SARIMA (RMSE = 54.11 and MAE = 45.51), the best deep learning model Informer (RMSE = 29.90 and MAE = 22.35) and the NASA's forecast (RMSE = 48.38 and MAE = 38.45). Our XGBoost-DL forecasts a peak sunspot number of 133.47 in May 2025 for Solar Cycle 25 and 164.62 in November 2035 for Solar Cycle 26, similar to but later than the NASA's at 137.7 in October 2024 and 161.2 in December 2034. An open-source Python package of our XGBoost-DL for the sunspot number prediction is available at https://github.com/yd1008/ts_ensemble_sunspot.
    Should You Mask 15% in Masked Language Modeling?. (arXiv:2202.08005v2 [cs.CL] UPDATED)
    Masked language models conventionally use a masking rate of 15% due to the belief that more masking would provide insufficient context to learn good representations, and less masking would make training too expensive. Surprisingly, we find that masking up to 40% of input tokens can outperform the 15% baseline, and even masking 80% can preserve most of the performance, as measured by finetuning on downstream tasks. Increasing the masking rates has two distinct effects, which we investigate through careful ablations: (1) A larger proportion of input tokens are corrupted, reducing the context size and creating a harder task, and (2) models perform more predictions, which benefits training. We observe that larger models with more capacity to tackle harder tasks in particular favor higher masking rates. We also find that even more sophisticated masking schemes such as span masking or PMI masking can benefit from higher masking rates, albeit to a smaller extent. Our results contribute to a better understanding of masked language modeling and shed light on more efficient language pre-training.
    Federated Self-supervised Learning for Heterogeneous Clients. (arXiv:2205.12493v1 [cs.LG])
    Federated Learning has become an important learning paradigm due to its privacy and computational benefits. As the field advances, two key challenges that still remain to be addressed are: (1) system heterogeneity - variability in the compute and/or data resources present on each client, and (2) lack of labeled data in certain federated settings. Several recent developments have tried to overcome these challenges independently. In this work, we propose a unified and systematic framework, \emph{Heterogeneous Self-supervised Federated Learning} (Hetero-SSFL) for enabling self-supervised learning with federation on heterogeneous clients. The proposed framework allows collaborative representation learning across all the clients without imposing architectural constraints or requiring presence of labeled data. The key idea in Hetero-SSFL is to let each client train its unique self-supervised model and enable the joint learning across clients by aligning the lower dimensional representations on a common dataset. The entire training procedure could be viewed as self and peer-supervised as both the local training and the alignment procedures do not require presence of any labeled data. As in conventional self-supervised learning, the obtained client models are task independent and can be used for varied end-tasks. We provide a convergence guarantee of the proposed framework for non-convex objectives in heterogeneous settings and also empirically demonstrate that our proposed approach outperforms the state of the art methods by a significant margin.
    Train Flat, Then Compress: Sharpness-Aware Minimization Learns More Compressible Models. (arXiv:2205.12694v1 [cs.CL])
    Model compression by way of parameter pruning, quantization, or distillation has recently gained popularity as an approach for reducing the computational requirements of modern deep neural network models for NLP. Pruning unnecessary parameters has emerged as a simple and effective method for compressing large models that is compatible with a wide variety of contemporary off-the-shelf hardware (unlike quantization), and that requires little additional training (unlike distillation). Pruning approaches typically take a large, accurate model as input, then attempt to discover a smaller subnetwork of that model capable of achieving end-task accuracy comparable to the full model. Inspired by previous work suggesting a connection between simpler, more generalizable models and those that lie within flat basins in the loss landscape, we propose to directly optimize for flat minima while performing task-specific pruning, which we hypothesize should lead to simpler parameterizations and thus more compressible models. In experiments combining sharpness-aware minimization with both iterative magnitude pruning and structured pruning approaches, we show that optimizing for flat minima consistently leads to greater compressibility of parameters compared to standard Adam optimization when fine-tuning BERT models, leading to higher rates of compression with little to no loss in accuracy on the GLUE classification benchmark.
    Misleading Deep-Fake Detection with GAN Fingerprints. (arXiv:2205.12543v1 [cs.CV])
    Generative adversarial networks (GANs) have made remarkable progress in synthesizing realistic-looking images that effectively outsmart even humans. Although several detection methods can recognize these deep fakes by checking for image artifacts from the generation process, multiple counterattacks have demonstrated their limitations. These attacks, however, still require certain conditions to hold, such as interacting with the detection method or adjusting the GAN directly. In this paper, we introduce a novel class of simple counterattacks that overcomes these limitations. In particular, we show that an adversary can remove indicative artifacts, the GAN fingerprint, directly from the frequency spectrum of a generated image. We explore different realizations of this removal, ranging from filtering high frequencies to more nuanced frequency-peak cleansing. We evaluate the performance of our attack with different detection methods, GAN architectures, and datasets. Our results show that an adversary can often remove GAN fingerprints and thus evade the detection of generated images.
    Inception Transformer. (arXiv:2205.12956v1 [cs.CV])
    Recent studies show that Transformer has strong capability of building long-range dependencies, yet is incompetent in capturing high frequencies that predominantly convey local information. To tackle this issue, we present a novel and general-purpose Inception Transformer, or iFormer for short, that effectively learns comprehensive features with both high- and low-frequency information in visual data. Specifically, we design an Inception mixer to explicitly graft the advantages of convolution and max-pooling for capturing the high-frequency information to Transformers. Different from recent hybrid frameworks, the Inception mixer brings greater efficiency through a channel splitting mechanism to adopt parallel convolution/max-pooling path and self-attention path as high- and low-frequency mixers, while having the flexibility to model discriminative information scattered within a wide frequency range. Considering that bottom layers play more roles in capturing high-frequency details while top layers more in modeling low-frequency global information, we further introduce a frequency ramp structure, i.e. gradually decreasing the dimensions fed to the high-frequency mixer and increasing those to the low-frequency mixer, which can effectively trade-off high- and low-frequency components across different layers. We benchmark the iFormer on a series of vision tasks, and showcase that it achieves impressive performance on image classification, COCO detection and ADE20K segmentation. For example, our iFormer-S hits the top-1 accuracy of 83.4% on ImageNet-1K, much higher than DeiT-S by 3.6%, and even slightly better than much bigger model Swin-B (83.3%) with only 1/4 parameters and 1/3 FLOPs. Code and models will be released at https://github.com/sail-sg/iFormer.
    Physics Guided Machine Learning for Variational Multiscale Reduced Order Modeling. (arXiv:2205.12419v1 [physics.flu-dyn])
    We propose a new physics guided machine learning (PGML) paradigm that leverages the variational multiscale (VMS) framework and available data to dramatically increase the accuracy of reduced order models (ROMs) at a modest computational cost. The hierarchical structure of the ROM basis and the VMS framework enable a natural separation of the resolved and unresolved ROM spatial scales. Modern PGML algorithms are used to construct novel models for the interaction among the resolved and unresolved ROM scales. Specifically, the new framework builds ROM operators that are closest to the true interaction terms in the VMS framework. Finally, machine learning is used to reduce the projection error and further increase the ROM accuracy. Our numerical experiments for a two-dimensional vorticity transport problem show that the novel PGML-VMS-ROM paradigm maintains the low computational cost of current ROMs, while significantly increasing the ROM accuracy.
    Deep Aesthetic Assessment and Retrieval of Breast Cancer Treatment Outcomes. (arXiv:2205.12611v1 [cs.CV])
    Treatments for breast cancer have continued to evolve and improve in recent years, resulting in a substantial increase in survival rates, with approximately 80\% of patients having a 10-year survival period. Given the serious impact that breast cancer treatments can have on a patient's body image, consequently affecting her self-confidence and sexual and intimate relationships, it is paramount to ensure that women receive the treatment that optimizes both survival and aesthetic outcomes. Currently, there is no gold standard for evaluating the aesthetic outcome of breast cancer treatment. In addition, there is no standard way to show patients the potential outcome of surgery. The presentation of similar cases from the past would be extremely important to manage women's expectations of the possible outcome. In this work, we propose a deep neural network to perform the aesthetic evaluation. As a proof-of-concept, we focus on a binary aesthetic evaluation. Besides its use for classification, this deep neural network can also be used to find the most similar past cases by searching for nearest neighbours in the highly semantic space before classification. We performed the experiments on a dataset consisting of 143 photos of women after conservative treatment for breast cancer. The results for accuracy and balanced accuracy showed the superior performance of our proposed model compared to the state of the art in aesthetic evaluation of breast cancer treatments. In addition, the model showed a good ability to retrieve similar previous cases, with the retrieved cases having the same or adjacent class (in the 4-class setting) and having similar types of asymmetry. Finally, a qualitative interpretability assessment was also performed to analyse the robustness and trustworthiness of the model.
    Low-rank Optimal Transport: Approximation, Statistics and Debiasing. (arXiv:2205.12365v1 [stat.ML])
    The matching principles behind optimal transport (OT) play an increasingly important role in machine learning, a trend which can be observed when OT is used to disambiguate datasets in applications (e.g. single-cell genomics) or used to improve more complex methods (e.g. balanced attention in transformers or self-supervised learning). To scale to more challenging problems, there is a growing consensus that OT requires solvers that can operate on millions, not thousands, of points. The low-rank optimal transport (LOT) approach advocated in \cite{scetbon2021lowrank} holds several promises in that regard, and was shown to complement more established entropic regularization approaches, being able to insert itself in more complex pipelines, such as quadratic OT. LOT restricts the search for low-cost couplings to those that have a low-nonnegative rank, yielding linear time algorithms in cases of interest. However, these promises can only be fulfilled if the LOT approach is seen as a legitimate contender to entropic regularization when compared on properties of interest, where the scorecard typically includes theoretical properties (statistical bounds, relation to other methods) or practical aspects (debiasing, hyperparameter tuning, initialization). We target each of these areas in this paper in order to cement the impact of low-rank approaches in computational OT.
    Ultra-compact Binary Neural Networks for Human Activity Recognition on RISC-V Processors. (arXiv:2205.12781v1 [cs.LG])
    Human Activity Recognition (HAR) is a relevant inference task in many mobile applications. State-of-the-art HAR at the edge is typically achieved with lightweight machine learning models such as decision trees and Random Forests (RFs), whereas deep learning is less common due to its high computational complexity. In this work, we propose a novel implementation of HAR based on deep neural networks, and precisely on Binary Neural Networks (BNNs), targeting low-power general purpose processors with a RISC-V instruction set. BNNs yield very small memory footprints and low inference complexity, thanks to the replacement of arithmetic operations with bit-wise ones. However, existing BNN implementations on general purpose processors impose constraints tailored to complex computer vision tasks, which result in over-parametrized models for simpler problems like HAR. Therefore, we also introduce a new BNN inference library, which targets ultra-compact models explicitly. With experiments on a single-core RISC-V processor, we show that BNNs trained on two HAR datasets obtain higher classification accuracy compared to a state-of-the-art baseline based on RFs. Furthermore, our BNN reaches the same accuracy of a RF with either less memory (up to 91%) or more energy-efficiency (up to 70%), depending on the complexity of the features extracted by the RF.
    VAEL: Bridging Variational Autoencoders and Probabilistic Logic Programming. (arXiv:2202.04178v2 [cs.PL] UPDATED)
    We present VAEL, a neuro-symbolic generative model integrating variational autoencoders (VAE) with the reasoning capabilities of probabilistic logic (L) programming. Besides standard latent subsymbolic variables, our model exploits a probabilistic logic program to define a further structured representation, which is used for logical reasoning. The entire process is end-to-end differentiable. Once trained, VAEL can solve new unseen generation tasks by (i) leveraging the previously acquired knowledge encoded in the neural component and (ii) exploiting new logical programs on the structured latent space. Our experiments provide support on the benefits of this neuro-symbolic integration both in terms of task generalization and data efficiency. To the best of our knowledge, this work is the first to propose a general-purpose end-to-end framework integrating probabilistic logic programming into a deep generative model.
    Towards a Fair Comparison and Realistic Design and Evaluation Framework of Android Malware Detectors. (arXiv:2205.12569v1 [cs.CR])
    As in other cybersecurity areas, machine learning (ML) techniques have emerged as a promising solution to detect Android malware. In this sense, many proposals employing a variety of algorithms and feature sets have been presented to date, often reporting impresive detection performances. However, the lack of reproducibility and the absence of a standard evaluation framework make these proposals difficult to compare. In this paper, we perform an analysis of 10 influential research works on Android malware detection using a common evaluation framework. We have identified five factors that, if not taken into account when creating datasets and designing detectors, significantly affect the trained ML models and their performances. In particular, we analyze the effect of (1) the presence of duplicated samples, (2) label (goodware/greyware/malware) attribution, (3) class imbalance, (4) the presence of apps that use evasion techniques and, (5) the evolution of apps. Based on this extensive experimentation, we conclude that the studied ML-based detectors have been evaluated optimistically, which justifies the good published results. Our findings also highlight that it is imperative to generate realistic datasets, taking into account the factors mentioned above, to enable the design and evaluation of better solutions for Android malware detection.
    Black box tests for algorithmic stability. (arXiv:2111.15546v2 [cs.LG] UPDATED)
    Algorithmic stability is a concept from learning theory that expresses the degree to which changes to the input data (e.g., removal of a single data point) may affect the outputs of a regression algorithm. Knowing an algorithm's stability properties is often useful for many downstream applications -- for example, stability is known to lead to desirable generalization properties and predictive inference guarantees. However, many modern algorithms currently used in practice are too complex for a theoretical analysis of their stability properties, and thus we can only attempt to establish these properties through an empirical exploration of the algorithm's behavior on various data sets. In this work, we lay out a formal statistical framework for this kind of "black box testing" without any assumptions on the algorithm or the data distribution, and establish fundamental bounds on the ability of any black box test to identify algorithmic stability.
    Deletion and Insertion Tests in Regression Models. (arXiv:2205.12423v1 [cs.LG])
    A basic task in explainable AI (XAI) is to identify the most important features behind a prediction made by a black box function $f$. The insertion and deletion tests of \cite{petsiuk2018rise} are used to judge the quality of algorithms that rank pixels from most to least important for a classification. Motivated by regression problems we establish a formula for their area under the curve (AUC) criteria in terms of certain main effects and interactions in an anchored decomposition of $f$. We find an expression for the expected value of the AUC under a random ordering of inputs to $f$ and propose an alternative area above a straight line for the regression setting. We use this criterion to compare feature importances computed by integrated gradients (IG) to those computed by Kernel SHAP (KS). Exact computation of KS grows exponentially with dimension, while that of IG grows linearly with dimension. In two data sets including binary variables we find that KS is superior to IG in insertion and deletion tests, but only by a very small amount. Our comparison problems include some binary inputs that pose a challenge to IG because it must use values between the possible variable levels. We show that IG will match KS when $f$ is an additive function plus a multilinear function of the variables. This includes a multilinear interpolation over the binary variables that would cause IG to have exponential cost in a naive implementation.
    VulBERTa: Simplified Source Code Pre-Training for Vulnerability Detection. (arXiv:2205.12424v1 [cs.CR])
    This paper presents VulBERTa, a deep learning approach to detect security vulnerabilities in source code. Our approach pre-trains a RoBERTa model with a custom tokenisation pipeline on real-world code from open-source C/C++ projects. The model learns a deep knowledge representation of the code syntax and semantics, which we leverage to train vulnerability detection classifiers. We evaluate our approach on binary and multi-class vulnerability detection tasks across several datasets (Vuldeepecker, Draper, REVEAL and muVuldeepecker) and benchmarks (CodeXGLUE and D2A). The evaluation results show that VulBERTa achieves state-of-the-art performance and outperforms existing approaches across different datasets, despite its conceptual simplicity, and limited cost in terms of size of training data and number of model parameters.
    Generating Natural Language Proofs with Verifier-Guided Search. (arXiv:2205.12443v1 [cs.CL])
    Deductive reasoning (drawing conclusions from assumptions) is a challenging problem in NLP. In this work, we focus on proof generation: given a hypothesis and a set of supporting facts in natural language, the model generates a proof tree indicating how to deduce the hypothesis from supporting facts. Instead of generating the entire proof in one shot, prior work has demonstrated the promise of stepwise generation but achieved limited success on real-world data. Existing stepwise methods struggle to generate proof steps that are both valid and relevant. In this paper, we present a novel stepwise method NLProofS (Natural Language Proof Search), which learns to generate relevant steps conditioning on the hypothesis. At the core of our approach, we train an independent verifier to check the validity of proof steps. Instead of generating steps greedily, we search for proofs maximizing a global proof score judged by the verifier. NLProofS achieves state-of-the-art performance on EntailmentBank and RuleTaker. For example, it improves the percentage of correctly predicted proofs from 20.9% to 33.3% in the distractor setting of EntailmentBank. This is the first time stepwise methods have led to better generation of challenging human-authored proofs.
    Learning to Maximize Speech Quality Directly Using MOS Prediction for Neural Text-to-Speech. (arXiv:2011.01174v5 [eess.AS] UPDATED)
    Although recent neural text-to-speech (TTS) systems have achieved high-quality speech synthesis, there are cases where a TTS system generates low-quality speech, mainly caused by limited training data or information loss during knowledge distillation. Therefore, we propose a novel method to improve speech quality by training a TTS model under the supervision of perceptual loss, which measures the distance between the maximum possible speech quality score and the predicted one. We first pre-train a mean opinion score (MOS) prediction model and then train a TTS model to maximize the MOS of synthesized speech using the pre-trained MOS prediction model. The proposed method can be applied independently regardless of the TTS model architecture or the cause of speech quality degradation and efficiently without increasing the inference time or model complexity. The evaluation results for the MOS and phone error rate demonstrate that our proposed approach improves previous models in terms of both naturalness and intelligibility.
    Sub-Task Decomposition Enables Learning in Sequence to Sequence Tasks. (arXiv:2204.02892v2 [cs.CL] UPDATED)
    The field of Natural Language Processing has experienced a dramatic leap in capabilities with the recent introduction of huge Language Models. Despite this success, natural language problems that involve several compounded steps are still practically unlearnable, even by the largest LMs. This complies with experimental failures for end-to-end learning of composite problems that were demonstrated in a variety of domains. An effective mitigation is to introduce intermediate supervision for solving sub-tasks of the compounded problem. Recently, several works have demonstrated high gains by taking a straightforward approach for incorporating intermediate supervision in compounded natural language problems: the sequence-to-sequence LM is fed with an augmented input, in which the decomposed tasks' labels are simply concatenated to the original input. In this paper, we prove a positive learning result that motivates these recent efforts. We show that when concatenating intermediate supervision to the input and training a sequence-to-sequence model on this modified input, unlearnable composite problems can become learnable. We show that this is true for any family of tasks which on the one hand, are unlearnable, and on the other hand, can be decomposed into a polynomial number of simple sub-tasks, each of which depends only on O(1) previous sub-task results. Beyond motivating contemporary empirical efforts for incorporating intermediate supervision in sequence-to-sequence language models, our positive theoretical result is the first of its kind in the landscape of results on the benefits of intermediate supervision for neural-network learning: Until now, all theoretical results on the subject are negative, i.e., show cases where learning is impossible without intermediate supervision, while our result is positive, showing that learning is facilitated in the presence of intermediate supervision.
    Beyond Impossibility: Balancing Sufficiency, Separation and Accuracy. (arXiv:2205.12327v1 [cs.LG])
    Among the various aspects of algorithmic fairness studied in recent years, the tension between satisfying both \textit{sufficiency} and \textit{separation} -- e.g. the ratios of positive or negative predictive values, and false positive or false negative rates across groups -- has received much attention. Following a debate sparked by COMPAS, a criminal justice predictive system, the academic community has responded by laying out important theoretical understanding, showing that one cannot achieve both with an imperfect predictor when there is no equal distribution of labels across the groups. In this paper, we shed more light on what might be still possible beyond the impossibility -- the existence of a trade-off means we should aim to find a good balance within it. After refining the existing theoretical result, we propose an objective that aims to balance \textit{sufficiency} and \textit{separation} measures, while maintaining similar accuracy levels. We show the use of such an objective in two empirical case studies, one involving a multi-objective framework, and the other fine-tuning of a model pre-trained for accuracy. We show promising results, where better trade-offs are achieved compared to existing alternatives.
    sat2pc: Estimating Point Cloud of Building Roofs from 2D Satellite Images. (arXiv:2205.12464v1 [cs.CV])
    Three-dimensional (3D) urban models have gained interest because of their applications in many use-cases such as urban planning and virtual reality. However, generating these 3D representations requires LiDAR data, which are not always readily available. Thus, the applicability of automated 3D model generation algorithms is limited to a few locations. In this paper, we propose sat2pc, a deep learning architecture that predicts the point cloud of a building roof from a single 2D satellite image. Our architecture combines Chamfer distance and EMD loss, resulting in better 2D to 3D performance. We extensively evaluate our model and perform ablation studies on a building roof dataset. Our results show that sat2pc was able to outperform existing baselines by at least 18.6%. Further, we show that the predicted point cloud captures more detail and geometric characteristics than other baselines.
    NECA: Network-Embedded Deep Representation Learning for Categorical Data. (arXiv:2205.12752v1 [cs.LG])
    We propose NECA, a deep representation learning method for categorical data. Built upon the foundations of network embedding and deep unsupervised representation learning, NECA deeply embeds the intrinsic relationship among attribute values and explicitly expresses data objects with numeric vector representations. Designed specifically for categorical data, NECA can support important downstream data mining tasks, such as clustering. Extensive experimental analysis demonstrated the effectiveness of NECA.
    Toward Discovering Options that Achieve Faster Planning. (arXiv:2205.12515v1 [cs.LG])
    We propose a new objective for option discovery that emphasizes the computational advantage of using options in planning. For a given set of episodic tasks and a given number of options, the objective prefers options that can be used to achieve a high return by composing few options. By composing few options, fast planning can be achieved. When faced with new tasks similar to the given ones, the discovered options are also expected to accelerate planning. Our objective extends the objective proposed by Harb et al. (2018) for the single-task setting to the multi-task setting. A closer look at Harb et al.'s objective shows that the best options discovered given one task are not likely to be useful for future unseen tasks and that the multi-task setting is indeed necessary for this purpose. In the same paper, Harb et al. also proposed an algorithm to optimize their objective, and the algorithm can be naturally extended to the multi-task setting. We empirically show that in the four-room domain the extension does not achieve a high objective value and propose a new algorithm that better optimizes the proposed objective. In the same four-room domain, we show that 1) a higher objective value is typically associated with options with which fewer planning iterations are needed to achieve near-optimal performance, 2) our new algorithm achieves a high objective value, which is close to the value achieved by a set of human-designed options, 3) the best number of planning iterations given the discovered options is much smaller and matches it obtained given human-designed options, and 4) the options produced by our algorithm also make intuitive sense because they move to and terminate at cells near hallways connecting two neighbor rooms.
    Mathematical Models of Human Drivers Using Artificial Risk Fields. (arXiv:2205.12722v1 [cs.LG])
    In this paper, we use the concept of artificial risk fields to predict how human operators control a vehicle in response to upcoming road situations. A risk field assigns a non-negative risk measure to the state of the system in order to model how close that state is to violating a safety property, such as hitting an obstacle or exiting the road. Using risk fields, we construct a stochastic model of the operator that maps from states to likely actions. We demonstrate our approach on a driving task wherein human subjects are asked to drive a car inside a realistic driving simulator while avoiding obstacles placed on the road. We show that the most likely risk field given the driving data is obtained by solving a convex optimization problem. Next, we apply the inferred risk fields to generate distinct driving behaviors while comparing predicted trajectories against ground truth measurements. We observe that the risk fields are excellent at predicting future trajectory distributions with high prediction accuracy for up to twenty seconds prediction horizons. At the same time, we observe some challenges such as the inability to account for how drivers choose to accelerate/decelerate based on the road conditions.
    Image Colorization using U-Net with Skip Connections and Fusion Layer on Landscape Images. (arXiv:2205.12867v1 [cs.CV])
    We present a novel technique to automatically colorize grayscale images that combine the U-Net model and Fusion Layer features. This approach allows the model to learn the colorization of images from pre-trained U-Net. Moreover, the Fusion layer is applied to merge local information results dependent on small image patches with global priors of an entire image on each class, forming visually more compelling colorization results. Finally, we validate our approach with a user study evaluation and compare it against state-of-the-art, resulting in improvements.
    Fast Inference and Transfer of Compositional Task Structures for Few-shot Task Generalization. (arXiv:2205.12648v1 [cs.LG])
    We tackle real-world problems with complex structures beyond the pixel-based game or simulator. We formulate it as a few-shot reinforcement learning problem where a task is characterized by a subtask graph that defines a set of subtasks and their dependencies that are unknown to the agent. Different from the previous meta-rl methods trying to directly infer the unstructured task embedding, our multi-task subtask graph inferencer (MTSGI) first infers the common high-level task structure in terms of the subtask graph from the training tasks, and use it as a prior to improve the task inference in testing. Our experiment results on 2D grid-world and complex web navigation domains show that the proposed method can learn and leverage the common underlying structure of the tasks for faster adaptation to the unseen tasks than various existing algorithms such as meta reinforcement learning, hierarchical reinforcement learning, and other heuristic agents.
    Mitigating multiple descents: A model-agnostic framework for risk monotonization. (arXiv:2205.12937v1 [math.ST])
    Recent empirical and theoretical analyses of several commonly used prediction procedures reveal a peculiar risk behavior in high dimensions, referred to as double/multiple descent, in which the asymptotic risk is a non-monotonic function of the limiting aspect ratio of the number of features or parameters to the sample size. To mitigate this undesirable behavior, we develop a general framework for risk monotonization based on cross-validation that takes as input a generic prediction procedure and returns a modified procedure whose out-of-sample prediction risk is, asymptotically, monotonic in the limiting aspect ratio. As part of our framework, we propose two data-driven methodologies, namely zero- and one-step, that are akin to bagging and boosting, respectively, and show that, under very mild assumptions, they provably achieve monotonic asymptotic risk behavior. Our results are applicable to a broad variety of prediction procedures and loss functions, and do not require a well-specified (parametric) model. We exemplify our framework with concrete analyses of the minimum $\ell_2$, $\ell_1$-norm least squares prediction procedures. As one of the ingredients in our analysis, we also derive novel additive and multiplicative forms of oracle risk inequalities for split cross-validation that are of independent interest.
    Lyapunov function approach for approximation algorithm design and analysis: with applications in submodular maximization. (arXiv:2205.12442v1 [math.OC])
    We propose a two-phase systematical framework for approximation algorithm design and analysis via Lyapunov function. The first phase consists of using Lyapunov function as a guideline to design a continuous-time algorithm with provable approximation ratio. The second phase then converts the continuous-time algorithm to a discrete-time algorithm with the same approximation ratio and a provable time complexity. Some immediate benefits of the Lyapunov function approach include: (i) unifying many existing algorithms; (ii) providing a guideline to design and analyze new algorithms; and (iii) offer new perspectives to potentially improve existing algorithms. We use various submodular maximization problems as running examples to illustrate our framework.
    Conformal Prediction Intervals with Temporal Dependence. (arXiv:2205.12940v1 [stat.ML])
    Cross-sectional prediction is common in many domains such as healthcare, including forecasting tasks using electronic health records, where different patients form a cross-section. We focus on the task of constructing valid prediction intervals (PIs) in time-series regression with a cross-section. A prediction interval is considered valid if it covers the true response with (a pre-specified) high probability. We first distinguish between two notions of validity in such a setting: cross-sectional and longitudinal. Cross-sectional validity is concerned with validity across the cross-section of the time series data, while longitudinal validity accounts for the temporal dimension. Coverage guarantees along both these dimensions are ideally desirable; however, we show that distribution-free longitudinal validity is theoretically impossible. Despite this limitation, we propose Conformal Prediction with Temporal Dependence (CPTD), a procedure which is able to maintain strict cross-sectional validity while improving longitudinal coverage. CPTD is post-hoc and light-weight, and can easily be used in conjunction with any prediction model as long as a calibration set is available. We focus on neural networks due to their ability to model complicated data such as diagnosis codes for time-series regression, and perform extensive experimental validation to verify the efficacy of our approach. We find that CPTD outperforms baselines on a variety of datasets by improving longitudinal coverage and often providing more efficient (narrower) PIs.
    Imposing Gaussian Pre-Activations in a Neural Network. (arXiv:2205.12379v1 [cs.LG])
    The goal of the present work is to propose a way to modify both the initialization distribution of the weights of a neural network and its activation function, such that all pre-activations are Gaussian. We propose a family of pairs initialization/activation, where the activation functions span a continuum from bounded functions (such as Heaviside or tanh) to the identity function. This work is motivated by the contradiction between existing works dealing with Gaussian pre-activations: on one side, the works in the line of the Neural Tangent Kernels and the Edge of Chaos are assuming it, while on the other side, theoretical and experimental results challenge this hypothesis. The family of pairs initialization/activation we are proposing will help us to answer this hot question: is it desirable to have Gaussian pre-activations in a neural network?
    xFraud: Explainable Fraud Transaction Detection. (arXiv:2011.12193v3 [cs.LG] UPDATED)
    At online retail platforms, it is crucial to actively detect the risks of transactions to improve customer experience and minimize financial loss. In this work, we propose xFraud, an explainable fraud transaction prediction framework which is mainly composed of a detector and an explainer. The xFraud detector can effectively and efficiently predict the legitimacy of incoming transactions. Specifically, it utilizes a heterogeneous graph neural network to learn expressive representations from the informative heterogeneously typed entities in the transaction logs. The explainer in xFraud can generate meaningful and human-understandable explanations from graphs to facilitate further processes in the business unit. In our experiments with xFraud on real transaction networks with up to 1.1 billion nodes and 3.7 billion edges, xFraud is able to outperform various baseline models in many evaluation metrics while remaining scalable in distributed settings. In addition, we show that xFraud explainer can generate reasonable explanations to significantly assist the business analysis via both quantitative and qualitative evaluations.
    Bayesian Physics-Informed Neural Networks for real-world nonlinear dynamical systems. (arXiv:2205.08304v2 [cs.LG] UPDATED)
    Understanding real-world dynamical phenomena remains a challenging task. Across various scientific disciplines, machine learning has advanced as the go-to technology to analyze nonlinear dynamical systems, identify patterns in big data, and make decision around them. Neural networks are now consistently used as universal function approximators for data with underlying mechanisms that are incompletely understood or exceedingly complex. However, neural networks alone ignore the fundamental laws of physics and often fail to make plausible predictions. Here we integrate data, physics, and uncertainties by combining neural networks, physics-informed modeling, and Bayesian inference to improve the predictive potential of traditional neural network models. We embed the physical model of a damped harmonic oscillator into a fully-connected feed-forward neural network to explore a simple and illustrative model system, the outbreak dynamics of COVID-19. Our Physics-Informed Neural Networks can seamlessly integrate data and physics, robustly solve forward and inverse problems, and perform well for both interpolation and extrapolation, even for a small amount of noisy and incomplete data. At only minor additional cost, they can self-adaptively learn the weighting between data and physics. Combined with Bayesian Neural Networks, they can serve as priors in a Bayesian Inference, and provide credible intervals for uncertainty quantification. Our study reveals the inherent advantages and disadvantages of Neural Networks, Bayesian Inference, and a combination of both and provides valuable guidelines for model selection. While we have only demonstrated these approaches for the simple model problem of a seasonal endemic infectious disease, we anticipate that the underlying concepts and trends generalize to more complex disease conditions and, more broadly, to a wide variety of nonlinear dynamical systems.
    On Representation Knowledge Distillation for Graph Neural Networks. (arXiv:2111.04964v2 [cs.LG] UPDATED)
    Knowledge distillation is a learning paradigm for boosting resource-efficient graph neural networks (GNNs) using more expressive yet cumbersome teacher models. Past work on distillation for GNNs proposed the Local Structure Preserving loss (LSP), which matches local structural relationships defined over edges across the student and teacher's node embeddings. This paper studies whether preserving the global topology of how the teacher embeds graph data can be a more effective distillation objective for GNNs, as real-world graphs often contain latent interactions and noisy edges. We propose Graph Contrastive Representation Distillation (G-CRD), which uses contrastive learning to implicitly preserve global topology by aligning the student node embeddings to those of the teacher in a shared representation space. Additionally, we introduce an expanded set of benchmarks on large-scale real-world datasets where the performance gap between teacher and student GNNs is non-negligible. Experiments across 4 datasets and 14 heterogeneous GNN architectures show that G-CRD consistently boosts the performance and robustness of lightweight GNNs, outperforming LSP (and a global structure preserving variant of LSP) as well as baselines from 2D computer vision. An analysis of the representational similarity among teacher and student embedding spaces reveals that G-CRD balances preserving local and global relationships, while structure preserving approaches are best at preserving one or the other.
    Learning dynamics from partial observations with structured neural ODEs. (arXiv:2205.12550v1 [eess.SY])
    Identifying dynamical systems from experimental data is a notably difficult task. Prior knowledge generally helps, but the extent of this knowledge varies with the application, and customized models are often needed. We propose a flexible framework to incorporate a broad spectrum of physical insight into neural ODE-based system identification, giving physical interpretability to the resulting latent space. This insight is either enforced through hard constraints in the optimization problem or added in its cost function. In order to link the partial and possibly noisy observations to the latent state, we rely on tools from nonlinear observer theory to build a recognition model. We demonstrate the performance of the proposed approach on numerical simulations and on an experimental dataset from a robotic exoskeleton.
    Heterogeneous Reservoir Computing Models for Persian Speech Recognition. (arXiv:2205.12594v1 [cs.SD])
    Over the last decade, deep-learning methods have been gradually incorporated into conventional automatic speech recognition (ASR) frameworks to create acoustic, pronunciation, and language models. Although it led to significant improvements in ASRs' recognition accuracy, due to their hard constraints related to hardware requirements (e.g., computing power and memory usage), it is unclear if such approaches are the most computationally- and energy-efficient options for embedded ASR applications. Reservoir computing (RC) models (e.g., echo state networks (ESNs) and liquid state machines (LSMs)), on the other hand, have been proven inexpensive to train, have vastly fewer parameters, and are compatible with emergent hardware technologies. However, their performance in speech processing tasks is relatively inferior to that of the deep-learning-based models. To enhance the accuracy of the RC in ASR applications, we propose heterogeneous single and multi-layer ESNs to create non-linear transformations of the inputs that capture temporal context at different scales. To test our models, we performed a speech recognition task on the Farsdat Persian dataset. Since, to the best of our knowledge, standard RC has not yet been employed to conduct any Persian ASR tasks, we also trained conventional single-layer and deep ESNs to provide baselines for comparison. Besides, we compared the RC performance with a standard long-short-term memory (LSTM) model. Heterogeneous RC models (1) show improved performance to the standard RC models; (2) perform on par in terms of recognition accuracy with the LSTM, and (3) reduce the training time considerably.
    FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech. (arXiv:2205.12446v1 [cs.CL])
    We introduce FLEURS, the Few-shot Learning Evaluation of Universal Representations of Speech benchmark. FLEURS is an n-way parallel speech dataset in 102 languages built on top of the machine translation FLoRes-101 benchmark, with approximately 12 hours of speech supervision per language. FLEURS can be used for a variety of speech tasks, including Automatic Speech Recognition (ASR), Speech Language Identification (Speech LangID), Translation and Retrieval. In this paper, we provide baselines for the tasks based on multilingual pre-trained models like mSLAM. The goal of FLEURS is to enable speech technology in more languages and catalyze research in low-resource speech understanding.
    MEKER: Memory Efficient Knowledge Embedding Representation for Link Prediction and Question Answering. (arXiv:2204.10629v2 [cs.CL] UPDATED)
    Knowledge Graphs (KGs) are symbolically structured storages of facts. The KG embedding contains concise data used in NLP tasks requiring implicit information about the real world. Furthermore, the size of KGs that may be useful in actual NLP assignments is enormous, and creating embedding over it has memory cost issues. We represent KG as a 3rd-order binary tensor and move beyond the standard CP decomposition by using a data-specific generalized version of it. The generalization of the standard CP-ALS algorithm allows obtaining optimization gradients without a backpropagation mechanism. It reduces the memory needed in training while providing computational benefits. We propose a MEKER, a memory-efficient KG embedding model, which yields SOTA-comparable performance on link prediction tasks and KG-based Question Answering.
    Learn2Agree: Fitting with Multiple Annotators without Objective Ground Truth. (arXiv:2109.03596v2 [cs.LG] UPDATED)
    The annotation of domain experts is important for some medical applications where the objective ground truth is ambiguous to define, e.g., the rehabilitation for some chronic diseases, and the prescreening of some musculoskeletal abnormalities without further medical examinations. However, improper uses of the annotations may hinder developing reliable models. On one hand, forcing the use of a single ground truth generated from multiple annotations is less informative for the modeling. On the other hand, feeding the model with all the annotations without proper regularization is noisy given existing disagreements. For such issues, we propose a novel Learning to Agreement (Learn2Agree) framework to tackle the challenge of learning from multiple annotators without objective ground truth. The framework has two streams, with one stream fitting with the multiple annotators and the other stream learning agreement information between annotators. In particular, the agreement learning stream produces regularization information to the classifier stream, tuning its decision to be better in line with the agreement between annotators. The proposed method can be easily added to existing backbones, with experiments on two medical datasets showed better agreement levels with annotators.
    Uncertainty Quantification for Transport in Porous media using Parameterized Physics Informed neural Networks. (arXiv:2205.12730v1 [cs.CE])
    We present a Parametrization of the Physics Informed Neural Network (P-PINN) approach to tackle the problem of uncertainty quantification in reservoir engineering problems. We demonstrate the approach with the immiscible two phase flow displacement (Buckley-Leverett problem) in heterogeneous porous medium. The reservoir properties (porosity, permeability) are treated as random variables. The distribution of these properties can affect dynamic properties such as the fluids saturation, front propagation speed or breakthrough time. We explore and use to our advantage the ability of networks to interpolate complex high dimensional functions. We observe that the additional dimensions resulting from a stochastic treatment of the partial differential equations tend to produce smoother solutions on quantities of interest (distributions parameters) which is shown to improve the performance of PINNS. We show that provided a proper parameterization of the uncertainty space, PINN can produce solutions that match closely both the ensemble realizations and the stochastic moments. We demonstrate applications for both homogeneous and heterogeneous fields of properties. We are able to solve problems that can be challenging for classical methods. This approach gives rise to trained models that are both more robust to variations in the input space and can compete in performance with traditional stochastic sampling methods.
    An Experimental Comparison Between Temporal Difference and Residual Gradient with Neural Network Approximation. (arXiv:2205.12770v1 [cs.LG])
    Gradient descent or its variants are popular in training neural networks. However, in deep Q-learning with neural network approximation, a type of reinforcement learning, gradient descent (also known as Residual Gradient (RG)) is barely used to solve Bellman residual minimization problem. On the contrary, Temporal Difference (TD), an incomplete gradient descent method prevails. In this work, we perform extensive experiments to show that TD outperforms RG, that is, when the training leads to a small Bellman residual error, the solution found by TD has a better policy and is more robust against the perturbation of neural network parameters. We further use experiments to reveal a key difference between reinforcement learning and supervised learning, that is, a small Bellman residual error can correspond to a bad policy in reinforcement learning while the test loss function in supervised learning is a standard index to indicate the performance. We also empirically examine that the missing term in TD is a key reason why RG performs badly. Our work shows that the performance of a deep Q-learning solution is closely related to the training dynamics and how an incomplete gradient descent method can find a good policy is interesting for future study.
    FBNETGEN: Task-aware GNN-based fMRI Analysis via Functional Brain Network Generation. (arXiv:2205.12465v1 [cs.LG])
    Functional magnetic resonance imaging (fMRI) is one of the most common imaging modalities to investigate brain functions. Recent studies in neuroscience stress the great potential of functional brain networks constructed from fMRI data for clinical predictions. Traditional functional brain networks, however, are noisy and unaware of downstream prediction tasks, while also incompatible with the deep graph neural network (GNN) models. In order to fully unleash the power of GNNs in network-based fMRI analysis, we develop FBNETGEN, a task-aware and interpretable fMRI analysis framework via deep brain network generation. In particular, we formulate (1) prominent region of interest (ROI) features extraction, (2) brain networks generation, and (3) clinical predictions with GNNs, in an end-to-end trainable model under the guidance of particular prediction tasks. Along with the process, the key novel component is the graph generator which learns to transform raw time-series features into task-oriented brain networks. Our learnable graphs also provide unique interpretations by highlighting prediction-related brain regions. Comprehensive experiments on two datasets, i.e., the recently released and currently largest publicly available fMRI dataset Adolescent Brain Cognitive Development (ABCD), and the widely-used fMRI dataset PNC, prove the superior effectiveness and interpretability of FBNETGEN. The implementation is available at https://github.com/Wayfear/FBNETGEN.}
    Multi-Head Online Learning for Delayed Feedback Modeling. (arXiv:2205.12406v1 [cs.LG])
    In online advertising, it is highly important to predict the probability and the value of a conversion (e.g., a purchase). It not only impacts user experience by showing relevant ads, but also affects ROI of advertisers and revenue of marketplaces. Unlike clicks, which often occur within minutes after impressions, conversions are expected to happen over a long period of time (e.g., 30 days for online shopping). It creates a challenge, as the true labels are only available after the long delays. Either inaccurate labels (partial conversions) are used, or models are trained on stale data (e.g., from 30 days ago). The problem is more eminent in online learning, which focuses on the live performance on the latest data. In this paper, a novel solution is presented to address this challenge using multi-head modeling. Unlike traditional methods, it directly quantizes conversions into multiple windows, such as day 1, day 2, day 3-7, and day 8-30. A sub-model is trained specifically on conversions within each window. Label freshness is maximally preserved in early models (e.g., day 1 and day 2), while late conversions are accurately utilized in models with longer delays (e.g., day 8-30). It is shown to greatly exceed the performance of known methods in online learning experiments for both conversion rate (CVR) and value per click (VPC) predictions. Lastly, as a general method for delayed feedback modeling, it can be combined with any advanced ML techniques to further improve the performance.
    DPSNN: A Differentially Private Spiking Neural Network. (arXiv:2205.12718v1 [cs.NE])
    Privacy-preserving is a key problem for the machine learning algorithm. Spiking neural network (SNN) plays an important role in many domains, such as image classification, object detection, and speech recognition, but the study on the privacy protection of SNN is urgently needed. This study combines the differential privacy (DP) algorithm and SNN and proposes differentially private spiking neural network (DPSNN). DP injects noise into the gradient, and SNN transmits information in discrete spike trains so that our differentially private SNN can maintain strong privacy protection while still ensuring high accuracy. We conducted experiments on MNIST, Fashion-MNIST, and the face recognition dataset Extended YaleB. When the privacy protection is improved, the accuracy of the artificial neural network(ANN) drops significantly, but our algorithm shows little change in performance. Meanwhile, we analyzed different factors that affect the privacy protection of SNN. Firstly, the less precise the surrogate gradient is, the better the privacy protection of the SNN. Secondly, the Integrate-And-Fire (IF) neurons perform better than leaky Integrate-And-Fire (LIF) neurons. Thirdly, a large time window contributes more to privacy protection and performance.
    Analytics of Business Time Series Using Machine Learning and Bayesian Inference. (arXiv:2205.12905v1 [cs.LG])
    In the survey we consider the case studies on sales time series forecasting, the deep learning approach for forecasting non-stationary time series using time trend correction, dynamic price and supply optimization using Q-learning, Bitcoin price modeling, COVID-19 spread impact on stock market, using social networks signals in analytics. The use of machine learning and Bayesian inference in predictive analytics has been analyzed.
    A Convergence Theory for Over-parameterized Variational Quantum Eigensolvers. (arXiv:2205.12481v1 [quant-ph])
    The Variational Quantum Eigensolver (VQE) is a promising candidate for quantum applications on near-term Noisy Intermediate-Scale Quantum (NISQ) computers. Despite a lot of empirical studies and recent progress in theoretical understanding of VQE's optimization landscape, the convergence for optimizing VQE is far less understood. We provide the first rigorous analysis of the convergence of VQEs in the over-parameterization regime. By connecting the training dynamics with the Riemannian Gradient Flow on the unit-sphere, we establish a threshold on the sufficient number of parameters for efficient convergence, which depends polynomially on the system dimension and the spectral ratio, a property of the problem Hamiltonian, and could be resilient to gradient noise to some extent. We further illustrate that this overparameterization threshold could be vastly reduced for specific VQE instances by establishing an ansatz-dependent threshold paralleling our main result. We showcase that our ansatz-dependent threshold could serve as a proxy of the trainability of different VQE ansatzes without performing empirical experiments, which hence leads to a principled way of evaluating ansatz design. Finally, we conclude with a comprehensive empirical study that supports our theoretical findings.
    Recipe2Vec: Multi-modal Recipe Representation Learning with Graph Neural Networks. (arXiv:2205.12396v1 [cs.LG])
    Learning effective recipe representations is essential in food studies. Unlike what has been developed for image-based recipe retrieval or learning structural text embeddings, the combined effect of multi-modal information (i.e., recipe images, text, and relation data) receives less attention. In this paper, we formalize the problem of multi-modal recipe representation learning to integrate the visual, textual, and relational information into recipe embeddings. In particular, we first present Large-RG, a new recipe graph data with over half a million nodes, making it the largest recipe graph to date. We then propose Recipe2Vec, a novel graph neural network based recipe embedding model to capture multi-modal information. Additionally, we introduce an adversarial attack strategy to ensure stable learning and improve performance. Finally, we design a joint objective function of node classification and adversarial learning to optimize the model. Extensive experiments demonstrate that Recipe2Vec outperforms state-of-the-art baselines on two classic food study tasks, i.e., cuisine category classification and region prediction. Dataset and codes are available at https://github.com/meettyj/Recipe2Vec.
    Impartial Games: A Challenge for Reinforcement Learning. (arXiv:2205.12787v1 [cs.LG])
    The AlphaZero algorithm and its successor MuZero have revolutionised several competitive strategy games, including chess, Go, and shogi and video games like Atari, by learning to play these games better than any human and any specialised computer program. Aside from knowing the rules, AlphaZero had no prior knowledge of each game. This dramatically advanced progress on a long-standing AI challenge to create programs that can learn for themselves from first principles. Theoretically, there are well-known limits to the power of deep learning for strategy games like chess, Go, and shogi, as they are known to be NEXPTIME hard. Some papers have argued that the AlphaZero methodology has limitations and is unsuitable for general AI. However, none of these works has suggested any specific limits for any particular game. In this paper, we provide more powerful bottlenecks than previously suggested. We present the first concrete example of a game - namely the (children) game of nim - and other impartial games that seem to be a stumbling block for AlphaZero and similar reinforcement learning algorithms. We show experimentally that the bottlenecks apply to both the policy and value networks. Since solving nim can be done in linear time using logarithmic space i.e. has very low-complexity, our experimental results supersede known theoretical limits based on many games' PSPACE (and NEXPTIME) completeness. We show that nim can be learned on small boards, but when the board size increases, AlphaZero style algorithms rapidly fail to improve. We quantify the difficulties for various setups, parameter settings and computational resources. Our results might help expand the AlphaZero self-play paradigm by allowing it to use meta-actions during training and/or actual game play like applying abstract transformations, or reading and writing to an external memory.
    The Dialog Must Go On: Improving Visual Dialog via Generative Self-Training. (arXiv:2205.12502v1 [cs.CV])
    Visual dialog (VisDial) is a task of answering a sequence of questions grounded in an image, using the dialog history as context. Prior work has trained the dialog agents solely on VisDial data via supervised learning or leveraged pre-training on related vision-and-language datasets. This paper presents a semi-supervised learning approach for visually-grounded dialog, called Generative Self-Training (GST), to leverage unlabeled images on the Web. Specifically, GST first retrieves in-domain images through out-of-distribution detection and generates synthetic dialogs regarding the images via multimodal conditional text generation. GST then trains a dialog agent on the synthetic and the original VisDial data. As a result, GST scales the amount of training data up to an order of magnitude that of VisDial (1.2M to 12.9M QA data). For robust training of the generated dialogs, we also propose perplexity-based data selection and multimodal consistency regularization. Evaluation on VisDial v1.0 and v0.9 datasets shows that GST achieves new state-of-the-art results on both datasets. We further observe strong performance gains in the low-data regime (up to 9.35 absolute points on NDCG).
    Investigating Information Inconsistency in Multilingual Open-Domain Question Answering. (arXiv:2205.12456v1 [cs.CL])
    Retrieval based open-domain QA systems use retrieved documents and answer-span selection over retrieved documents to find best-answer candidates. We hypothesize that multilingual Question Answering (QA) systems are prone to information inconsistency when it comes to documents written in different languages, because these documents tend to provide a model with varying information about the same topic. To understand the effects of the biased availability of information and cultural influence, we analyze the behavior of multilingual open-domain question answering models with a focus on retrieval bias. We analyze if different retriever models present different passages given the same question in different languages on TyDi QA and XOR-TyDi QA, two multilingualQA datasets. We speculate that the content differences in documents across languages might reflect cultural divergences and/or social biases.
    Graph-Based Similarity of Neural Network Representations. (arXiv:2111.11165v2 [cs.LG] UPDATED)
    Understanding the black-box representations in Deep Neural Networks (DNN) is an essential problem in deep learning. In this work, we propose Graph-Based Similarity (GBS) to measure the similarity of layer features. Contrary to previous works that compute the similarity directly on the feature maps, GBS measures the correlation based on the graph constructed with hidden layer outputs. By treating each input sample as a node and the corresponding layer output similarity as edges, we construct the graph of DNN representations for each layer. The similarity between graphs of layers identifies the correspondences between representations of models trained in different datasets and initializations. We demonstrate and prove the invariance property of GBS, including invariance to orthogonal transformation and invariance to isotropic scaling, and compare GBS with CKA. GBS shows state-of-the-art performance in reflecting the similarity and provides insights on explaining the adversarial sample behavior on the hidden layer space.
    The worst of both worlds: A comparative analysis of errors in learning from data in psychology and machine learning. (arXiv:2203.06498v7 [cs.LG] UPDATED)
    Recent arguments that machine learning (ML) is facing a reproducibility and replication crisis suggest that some published claims in ML research cannot be taken at face value. These concerns inspire analogies to the replication crisis affecting the social and medical sciences. They also inspire calls for the integration of statistical approaches to causal inference and predictive modeling. A deeper understanding of what reproducibility concerns in supervised ML research have in common with the replication crisis in experimental science puts the new concerns in perspective, and helps researchers avoid "the worst of both worlds," where ML researchers begin borrowing methodologies from explanatory modeling without understanding their limitations and vice versa. We contribute a comparative analysis of concerns about inductive learning that arise in causal attribution as exemplified in psychology versus predictive modeling as exemplified in ML. We identify themes that re-occur in reform discussions, like overreliance on asymptotic theory and non-credible beliefs about real-world data generating processes. We argue that in both fields, claims from learning are implied to generalize outside the specific environment studied (e.g., the input dataset or subject sample, modeling implementation, etc.) but are often impossible to refute due to undisclosed sources of variance in the learning pipeline. In particular, errors being acknowledged in ML expose cracks in long-held beliefs that optimizing predictive accuracy using huge datasets absolves one from having to consider a true data generating process or formally represent uncertainty in performance claims. We conclude by discussing risks that arise when sources of errors are misdiagnosed and the need to acknowledge the role of human inductive biases in learning and reform.
    Trust-based Consensus in Multi-Agent Reinforcement Learning Systems. (arXiv:2205.12880v1 [cs.MA])
    An often neglected issue in multi-agent reinforcement learning (MARL) is the potential presence of unreliable agents in the environment whose deviations from expected behavior can prevent a system from accomplishing its intended tasks. In particular, consensus is a fundamental underpinning problem of cooperative distributed multi-agent systems. Consensus requires different agents, situated in a decentralized communication network, to reach an agreement out of a set of initial proposals that they put forward. Learning-based agents should adopt a protocol that allows them to reach consensus despite having one or more unreliable agents in the system. This paper investigates the problem of unreliable agents in MARL, considering consensus as case study. Echoing established results in the distributed systems literature, our experiments show that even a moderate fraction of such agents can greatly impact the ability of reaching consensus in a networked environment. We propose Reinforcement Learning-based Trusted Consensus (RLTC), a decentralized trust mechanism, in which agents can independently decide which neighbors to communicate with. We empirically demonstrate that our trust mechanism is able to deal with unreliable agents effectively, as evidenced by higher consensus success rates.
    MGX: Near-Zero Overhead Memory Protection for Data-Intensive Accelerators. (arXiv:2004.09679v2 [cs.CR] UPDATED)
    This paper introduces MGX, a near-zero overhead memory protection scheme for hardware accelerators. MGX minimizes the performance overhead of off-chip memory encryption and integrity verification by exploiting the application-specific properties of the accelerator execution. In particular, accelerators tend to explicitly manage data movement between on-chip and off-chip memories. Therefore, the general memory access pattern of an accelerator can largely be determined for a given application. Exploiting these characteristics, MGX generates version numbers used in memory encryption and integrity verification using on-chip accelerator state rather than storing them in the off-chip memory; it also customizes the granularity of the memory protection to match the granularity used by the accelerator. To demonstrate the efficacy of MGX, we present an in-depth study of MGX for DNN and graph algorithms. Experimental results show that on average, MGX lowers the performance overhead of memory protection from 28% and 33% to 4% and 5% for DNN and graph processing accelerators in a wide range of benchmarks, respectively.
    RLPrompt: Optimizing Discrete Text Prompts With Reinforcement Learning. (arXiv:2205.12548v1 [cs.CL])
    Prompting has shown impressive success in enabling large pretrained language models (LMs) to perform diverse NLP tasks, especially when only few downstream data are available. Automatically finding the optimal prompt for each task, however, is challenging. Most existing work resorts to tuning soft prompt (e.g., embeddings) which falls short of interpretability, reusability across LMs, and applicability when gradients are not accessible. Discrete prompt, on the other hand, is difficult to optimize, and is often created by "enumeration (e.g., paraphrasing)-then-selection" heuristics that do not explore the prompt space systematically. This paper proposes RLPrompt, an efficient discrete prompt optimization approach with reinforcement learning (RL). RLPrompt formulates a parameter-efficient policy network that generates the desired discrete prompt after training with reward. To overcome the complexity and stochasticity of reward signals by the large LM environment, we incorporate effective reward stabilization that substantially enhances the training efficiency. RLPrompt is flexibly applicable to different types of LMs, such as masked (e.g., BERT) and left-to-right models (e.g., GPTs), for both classification and generation tasks. Experiments on few-shot classification and unsupervised text style transfer show superior performance over a wide range of existing finetuning or prompting methods. Interestingly, the resulting optimized prompts are often ungrammatical gibberish text; and surprisingly, those gibberish prompts are transferrable between different LMs to retain significant performance, indicating LM prompting may not follow human language patterns.
    Towards Understanding Label Regularization for Fine-tuning Pre-trained Language Models. (arXiv:2205.12428v1 [cs.LG])
    Knowledge Distillation (KD) is a prominent neural model compression technique which heavily relies on teacher network predictions to guide the training of a student model. Considering the ever-growing size of pre-trained language models (PLMs), KD is often adopted in many NLP tasks involving PLMs. However, it is evident that in KD, deploying the teacher network during training adds to the memory and computational requirements of training. In the computer vision literature, the necessity of the teacher network is put under scrutiny by showing that KD is a label regularization technique that can be replaced with lighter teacher-free variants such as the label-smoothing technique. However, to the best of our knowledge, this issue is not investigated in NLP. Therefore, this work concerns studying different label regularization techniques and whether we actually need the teacher labels to fine-tune smaller PLM student networks on downstream tasks. In this regard, we did a comprehensive set of experiments on different PLMs such as BERT, RoBERTa, and GPT with more than 600 distinct trials and ran each configuration five times. This investigation led to a surprising observation that KD and other label regularization techniques do not play any meaningful role over regular fine-tuning when the student model is pre-trained. We further explore this phenomenon in different settings of NLP and computer vision tasks and demonstrate that pre-training itself acts as a kind of regularization, and additional label regularization is unnecessary.
    Multimodal active speaker detection and virtual cinematography for video conferencing. (arXiv:2002.03977v3 [eess.AS] UPDATED)
    Active speaker detection (ASD) and virtual cinematography (VC) can significantly improve the remote user experience of a video conference by automatically panning, tilting and zooming of a video conferencing camera: users subjectively rate an expert video cinematographer's video significantly higher than unedited video. We describe a new automated ASD and VC that performs within 0.3 MOS of an expert cinematographer based on subjective ratings with a 1-5 scale. This system uses a 4K wide-FOV camera, a depth camera, and a microphone array; it extracts features from each modality and trains an ASD using an AdaBoost machine learning system that is very efficient and runs in real-time. A VC is similarly trained using machine learning to optimize the subjective quality of the overall experience. To avoid distracting the room participants and reduce switching latency the system has no moving parts -- the VC works by cropping and zooming the 4K wide-FOV video stream. The system was tuned and evaluated using extensive crowdsourcing techniques and evaluated on a dataset with N=100 meetings, each 2-5 minutes in length.
    A Tree-Structured Multi-Task Model Recommender. (arXiv:2203.05092v2 [cs.LG] UPDATED)
    Tree-structured multi-task architectures have been employed to jointly tackle multiple vision tasks in the context of multi-task learning (MTL). The major challenge is to determine where to branch out for each task given a backbone model to optimize for both task accuracy and computation efficiency. To address the challenge, this paper proposes a recommender that, given a set of tasks and a convolutional neural network-based backbone model, automatically suggests tree-structured multi-task architectures that could achieve a high task performance while meeting a user-specified computation budget without performing model training. Extensive evaluations on popular MTL benchmarks show that the recommended architectures could achieve competitive task accuracy and computation efficiency compared with state-of-the-art MTL methods. Our tree-structured multi-task model recommender is open-sourced and available at https://github.com/zhanglijun95/TreeMTL.
    Exact Phase Transitions in Deep Learning. (arXiv:2205.12510v1 [cs.LG])
    This work reports deep-learning-unique first-order and second-order phase transitions, whose phenomenology closely follows that in statistical physics. In particular, we prove that the competition between prediction error and model complexity in the training loss leads to the second-order phase transition for nets with one hidden layer and the first-order phase transition for nets with more than one hidden layer. The proposed theory is directly relevant to the optimization of neural networks and points to an origin of the posterior collapse problem in Bayesian deep learning.
    Meta-Learning-Based Robust Adaptive Flight Control Under Uncertain Wind Conditions. (arXiv:2103.01932v3 [cs.RO] UPDATED)
    Realtime model learning proves challenging for complex dynamical systems, such as drones flying in variable wind conditions. Machine learning technique such as deep neural networks have high representation power but is often too slow to update onboard. On the other hand, adaptive control relies on simple linear parameter models can update as fast as the feedback control loop. We propose an online composite adaptation method that treats outputs from a deep neural network as a set of basis functions capable of representing different wind conditions. To help with training, meta-learning techniques are used to optimize the network output useful for adaptation. We validate our approach by flying a drone in an open air wind tunnel under varying wind conditions and along challenging trajectories. We compare the result with other adaptive controller with different basis function sets and show improvement over tracking and prediction errors.
    Linear Algorithms for Nonparametric Multiclass Probability Estimation. (arXiv:2205.12460v1 [stat.ME])
    Multiclass probability estimation is the problem of estimating conditional probabilities of a data point belonging to a class given its covariate information. It has broad applications in statistical analysis and data science. Recently a class of weighted Support Vector Machines (wSVMs) have been developed to estimate class probabilities through ensemble learning for $K$-class problems (Wang, Shen and Liu, 2008; Wang, Zhang and Wu, 2019), where $K$ is the number of classes. The estimators are robust and achieve high accuracy for probability estimation, but their learning is implemented through pairwise coupling, which demand polynomial time in $K$. In this paper, we propose two new learning schemes, the baseline learning and the One-vs-All (OVA) learning, to further improve wSVMs in terms of computational efficiency and estimation accuracy. In particular, the baseline learning has optimal computational complexity in the sense that it is linear in $K$. The resulting estimators are distribution-free and shown to be consistent. We further conduct extensive numerical experiments to demonstrate finite sample performance.
    Memorization in NLP Fine-tuning Methods. (arXiv:2205.12506v1 [cs.CL])
    Large language models are shown to present privacy risks through memorization of training data, and several recent works have studied such risks for the pre-training phase. Little attention, however, has been given to the fine-tuning phase and it is not well understood how different fine-tuning methods (such as fine-tuning the full model, the model head, and adapter) compare in terms of memorization risk. This presents increasing concern as the "pre-train and fine-tune" paradigm proliferates. In this paper, we empirically study memorization of fine-tuning methods using membership inference and extraction attacks, and show that their susceptibility to attacks is very different. We observe that fine-tuning the head of the model has the highest susceptibility to attacks, whereas fine-tuning smaller adapters appears to be less vulnerable to known extraction attacks.
    The Web Is Your Oyster -- Knowledge-Intensive NLP against a Very Large Web Corpus. (arXiv:2112.09924v2 [cs.CL] UPDATED)
    In order to address increasing demands of real-world applications, the research for knowledge-intensive NLP (KI-NLP) should advance by capturing the challenges of a truly open-domain environment: web-scale knowledge, lack of structure, inconsistent quality and noise. To this end, we propose a new setup for evaluating existing knowledge intensive tasks in which we generalize the background corpus to a universal web snapshot. We investigate a slate of NLP tasks which rely on knowledge - either factual or common sense, and ask systems to use a subset of CCNet - the Sphere corpus - as a knowledge source. In contrast to Wikipedia, otherwise a common background corpus in KI-NLP, Sphere is orders of magnitude larger and better reflects the full diversity of knowledge on the web. Despite potential gaps in coverage, challenges of scale, lack of structure and lower quality, we find that retrieval from Sphere enables a state of the art system to match and even outperform Wikipedia-based models on several tasks. We also observe that while a dense index can outperform a sparse BM25 baseline on Wikipedia, on Sphere this is not yet possible. To facilitate further research and minimise the community's reliance on proprietary, black-box search engines, we share our indices, evaluation metrics and infrastructure.
    Neural Sheaf Diffusion: A Topological Perspective on Heterophily and Oversmoothing in GNNs. (arXiv:2202.04579v2 [cs.LG] UPDATED)
    Cellular sheaves equip graphs with a "geometrical" structure by assigning vector spaces and linear maps to nodes and edges. Graph Neural Networks (GNNs) implicitly assume a graph with a trivial underlying sheaf. This choice is reflected in the structure of the graph Laplacian operator, the properties of the associated diffusion equation, and the characteristics of the convolutional models that discretise this equation. In this paper, we use cellular sheaf theory to show that the underlying geometry of the graph is deeply linked with the performance of GNNs in heterophilic settings and their oversmoothing behaviour. By considering a hierarchy of increasingly general sheaves, we study how the ability of the sheaf diffusion process to achieve linear separation of the classes in the infinite time limit expands. At the same time, we prove that when the sheaf is non-trivial, discretised parametric diffusion processes have greater control than GNNs over their asymptotic behaviour. On the practical side, we study how sheaves can be learned from data. The resulting sheaf diffusion models have many desirable properties that address the limitations of classical graph diffusion equations (and corresponding GNN models) and obtain state-of-the-art results in heterophilic settings. Overall, our work provides new connections between GNNs and algebraic topology and would be of interest to both fields.
    Training Language Models with Memory Augmentation. (arXiv:2205.12674v1 [cs.CL])
    Recent work has improved language models remarkably by equipping them with a non-parametric memory component. However, most existing approaches only introduce memories at testing time, or represent them using a separately trained encoder -- resulting in sub-optimal training of the language model. In this work, we present TRIME, a novel yet simple training approach designed for training language models with memory augmentation. Our approach uses a training objective that directly takes in-batch examples as accessible memory. We also present new methods for memory construction and data batching, which are used for adapting to different sets of memories -- local, long-term, and external memory -- at testing time. We evaluate our approach on multiple language modeling and machine translation benchmarks. We find that simply replacing the vanilla language modeling objective by ours greatly reduces the perplexity, without modifying the model architecture or incorporating extra context (e.g., 18.70 $\to$ 17.76 on WikiText-103). We further augment language models with long-range contexts and external knowledge and demonstrate significant gains over previous memory-augmented approaches.
    lpSpikeCon: Enabling Low-Precision Spiking Neural Network Processing for Efficient Unsupervised Continual Learning on Autonomous Agents. (arXiv:2205.12295v1 [cs.NE])
    Recent advances have shown that SNN-based systems can efficiently perform unsupervised continual learning due to their bio-plausible learning rule, e.g., Spike-Timing-Dependent Plasticity (STDP). Such learning capabilities are especially beneficial for use cases like autonomous agents (e.g., robots and UAVs) that need to continuously adapt to dynamically changing scenarios/environments, where new data gathered directly from the environment may have novel features that should be learned online. Current state-of-the-art works employ high-precision weights (i.e., 32 bit) for both training and inference phases, which pose high memory and energy costs thereby hindering efficient embedded implementations of such systems for battery-driven mobile autonomous systems. On the other hand, precision reduction may jeopardize the quality of unsupervised continual learning due to information loss. Towards this, we propose lpSpikeCon, a novel methodology to enable low-precision SNN processing for efficient unsupervised continual learning on resource-constrained autonomous agents/systems. Our lpSpikeCon methodology employs the following key steps: (1) analyzing the impacts of training the SNN model under unsupervised continual learning settings with reduced weight precision on the inference accuracy; (2) leveraging this study to identify SNN parameters that have a significant impact on the inference accuracy; and (3) developing an algorithm for searching the respective SNN parameter values that improve the quality of unsupervised continual learning. The experimental results show that our lpSpikeCon can reduce weight memory of the SNN model by 8x (i.e., by judiciously employing 4-bit weights) for performing online training with unsupervised continual learning and achieve no accuracy loss in the inference phase, as compared to the baseline model with 32-bit weights across different network sizes.
    Linear Connectivity Reveals Generalization Strategies. (arXiv:2205.12411v1 [cs.LG])
    It is widely accepted in the mode connectivity literature that when two neural networks are trained similarly on the same data, they are connected by a path through parameter space over which test set accuracy is maintained. Under some circumstances, including transfer learning from pretrained models, these paths are presumed to be linear. In contrast to existing results, we find that among text classifiers (trained on MNLI, QQP, and CoLA), some pairs of finetuned models have large barriers of increasing loss on the linear paths between them. On each task, we find distinct clusters of models which are linearly connected on the test loss surface, but are disconnected from models outside the cluster -- models that occupy separate basins on the surface. By measuring performance on specially-crafted diagnostic datasets, we find that these clusters correspond to different generalization strategies: one cluster behaves like a bag of words model under domain shift, while another cluster uses syntactic heuristics. Our work demonstrates how the geometry of the loss surface can guide models towards different heuristic functions.
    FreDo: Frequency Domain-based Long-Term Time Series Forecasting. (arXiv:2205.12301v1 [cs.LG])
    The ability to forecast far into the future is highly beneficial to many applications, including but not limited to climatology, energy consumption, and logistics. However, due to noise or measurement error, it is questionable how far into the future one can reasonably predict. In this paper, we first mathematically show that due to error accumulation, sophisticated models might not outperform baseline models for long-term forecasting. To demonstrate, we show that a non-parametric baseline model based on periodicity can actually achieve comparable performance to a state-of-the-art Transformer-based model on various datasets. We further propose FreDo, a frequency domain-based neural network model that is built on top of the baseline model to enhance its performance and which greatly outperforms the state-of-the-art model. Finally, we validate that the frequency domain is indeed better by comparing univariate models trained in the frequency v.s. time domain.
    Machine learning method for return direction forecasting of Exchange Traded Funds using classification and regression models. (arXiv:2205.12746v1 [q-fin.CP])
    This article aims to propose and apply a machine learning method to analyze the direction of returns from Exchange Traded Funds (ETFs) using the historical return data of its components, helping to make investment strategy decisions through a trading algorithm. In methodological terms, regression and classification models were applied, using standard datasets from Brazilian and American markets, in addition to algorithmic error metrics. In terms of research results, they were analyzed and compared to those of the Na\"ive forecast and the returns obtained by the buy & hold technique in the same period of time. In terms of risk and return, the models mostly performed better than the control metrics, with emphasis on the linear regression model and the classification models by logistic regression, support vector machine (using the LinearSVC model), Gaussian Naive Bayes and K-Nearest Neighbors, where in certain datasets the returns exceeded by two times and the Sharpe ratio by up to four times those of the buy & hold control model.
    Rethinking Fano's Inequality in Ensemble Learning. (arXiv:2205.12683v1 [cs.LG])
    We propose a fundamental theory on ensemble learning that evaluates a given ensemble system by a well-grounded set of metrics. Previous studies used a variant of Fano's inequality of information theory and derived a lower bound of the classification error rate on the basis of the accuracy and diversity of models. We revisit the original Fano's inequality and argue that the studies did not take into account the information lost when multiple model predictions are combined into a final prediction. To address this issue, we generalize the previous theory to incorporate the information loss. Further, we empirically validate and demonstrate the proposed theory through extensive experiments on actual systems. The theory reveals the strengths and weaknesses of systems on each metric, which will push the theoretical understanding of ensemble learning and give us insights into designing systems.
    CGX: Adaptive System Support for Communication-Efficient Deep Learning. (arXiv:2111.08617v4 [cs.DC] UPDATED)
    The ability to scale out training workloads has been one of the key performance enablers of deep learning. The main scaling approach is data-parallel GPU-based training, which has been boosted by hardware and software support for highly efficient point-to-point communication, and in particular via hardware bandwidth overprovisioning. Overprovisioning comes at a cost: there is an order of magnitude price difference between "cloud-grade" servers with such support, relative to their popular "consumer-grade" counterparts, although single server-grade and consumer-grade GPUs can have similar computational envelopes. In this paper, we show that the costly hardware overprovisioning approach can be supplanted via algorithmic and system design, and propose a framework called CGX, which provides efficient software support for compressed communication in ML applications, for both multi-GPU single-node training, as well as larger-scale multi-node training. CGX is based on two technical advances: \emph{At the system level}, it relies on a re-developed communication stack for ML frameworks, which provides flexible, highly-efficient support for compressed communication. \emph{At the application level}, it provides \emph{seamless, parameter-free} integration with popular frameworks, so that end-users do not have to modify training recipes, nor significant training code. This is complemented by a \emph{layer-wise adaptive compression} technique which dynamically balances compression gains with accuracy preservation. CGX integrates with popular ML frameworks, providing up to 3X speedups for multi-GPU nodes based on commodity hardware, and order-of-magnitude improvements in the multi-node setting, with negligible impact on accuracy.
    Machine learning methods for Schlieren imaging of a plasma channel in tenuous atomic vapor. (arXiv:2205.12731v1 [physics.plasm-ph])
    We investigate the usage of a Schlieren imaging setup to measure the geometrical dimensions of a plasma channel in atomic vapor. Near resonant probe light is used to image the plasma channel in a tenuous vapor and machine learning techniques are tested for extracting quantitative information from the images. By building a database of simulated signals with a range of plasma parameters for training Deep Neural Networks, we demonstrate that they can extract from the Schlieren images reliably and with high accuracy the location, the radius and the maximum ionization fraction of the plasma channel as well as the width of the transition region between the core of the plasma channel and the unionized vapor. We test several different neural network architectures with supervised learning and show that the parameter estimations supplied by the networks are resilient with respect to slight changes of the experimental parameters that may occur in the course of a measurement.
    Scalable Online Change Detection for High-dimensional Data Streams. (arXiv:2205.12706v1 [cs.LG])
    Detecting changes in data streams is a core objective in their analysis and has applications in, say, predictive maintenance, fraud detection, and medicine. A principled approach to detect changes is to compare distributions observed within the stream to each other. However, data streams often are high-dimensional, and changes can be complex, e.g., only manifest themselves in higher moments. The streaming setting also imposes heavy memory and computation restrictions. We propose an algorithm, Maximum Mean Discrepancy Adaptive Windowing (MMDAW), which leverages the well-known Maximum Mean Discrepancy (MMD) two-sample test, and facilitates its efficient online computation on windows whose size it flexibly adapts. As MMD is sensitive to any change in the underlying distribution, our algorithm is a general-purpose non-parametric change detector that fulfills the requirements imposed by the streaming setting. Our experiments show that MMDAW achieves better detection quality than state-of-the-art competitors.
    First Contact: Unsupervised Human-Machine Co-Adaptation via Mutual Information Maximization. (arXiv:2205.12381v1 [cs.LG])
    How can we train an assistive human-machine interface (e.g., an electromyography-based limb prosthesis) to translate a user's raw command signals into the actions of a robot or computer when there is no prior mapping, we cannot ask the user for supervision in the form of action labels or reward feedback, and we do not have prior knowledge of the tasks the user is trying to accomplish? The key idea in this paper is that, regardless of the task, when an interface is more intuitive, the user's commands are less noisy. We formalize this idea as a completely unsupervised objective for optimizing interfaces: the mutual information between the user's command signals and the induced state transitions in the environment. To evaluate whether this mutual information score can distinguish between effective and ineffective interfaces, we conduct an observational study on 540K examples of users operating various keyboard and eye gaze interfaces for typing, controlling simulated robots, and playing video games. The results show that our mutual information scores are predictive of the ground-truth task completion metrics in a variety of domains, with an average Spearman's rank correlation of 0.43. In addition to offline evaluation of existing interfaces, we use our unsupervised objective to learn an interface from scratch: we randomly initialize the interface, have the user attempt to perform their desired tasks using the interface, measure the mutual information score, and update the interface to maximize mutual information through reinforcement learning. We evaluate our method through a user study with 12 participants who perform a 2D cursor control task using a perturbed mouse, and an experiment with one user playing the Lunar Lander game using hand gestures. The results show that we can learn an interface from scratch, without any user supervision or prior knowledge of tasks, in under 30 minutes.
    An Empirical Study on Distribution Shift Robustness From the Perspective of Pre-Training and Data Augmentation. (arXiv:2205.12753v1 [cs.CV])
    The performance of machine learning models under distribution shift has been the focus of the community in recent years. Most of current methods have been proposed to improve the robustness to distribution shift from the algorithmic perspective, i.e., designing better training algorithms to help the generalization in shifted test distributions. This paper studies the distribution shift problem from the perspective of pre-training and data augmentation, two important factors in the practice of deep learning that have not been systematically investigated by existing work. By evaluating seven pre-trained models, including ResNets and ViT's with self-supervision and supervision mode, on five important distribution-shift datasets, from WILDS and DomainBed benchmarks, with five different learning algorithms, we provide the first comprehensive empirical study focusing on pre-training and data augmentation. With our empirical result obtained from 1,330 models, we provide the following main observations: 1) ERM combined with data augmentation can achieve state-of-the-art performance if we choose a proper pre-trained model respecting the data property; 2) specialized algorithms further improve the robustness on top of ERM when handling a specific type of distribution shift, e.g., GroupDRO for spurious correlation and CORAL for large-scale out-of-distribution data; 3) Comparing different pre-training modes, architectures and data sizes, we provide novel observations about pre-training on distribution shift, which sheds light on designing or selecting pre-training strategy for different kinds of distribution shifts. In summary, our empirical study provides a comprehensive baseline for a wide range of pre-training models fine-tuned with data augmentation, which potentially inspires research exploiting the power of pre-training and data augmentation in the future of distribution shift study.
    An Evolutionary Approach to Dynamic Introduction of Tasks in Large-scale Multitask Learning Systems. (arXiv:2205.12755v1 [cs.LG])
    Multitask learning assumes that models capable of learning from multiple tasks can achieve better quality and efficiency via knowledge transfer, a key feature of human learning. Though, state of the art ML models rely on high customization for each task and leverage size and data scale rather than scaling the number of tasks. Also, continual learning, that adds the temporal aspect to multitask, is often focused to the study of common pitfalls such as catastrophic forgetting instead of being studied at a large scale as a critical component to build the next generation artificial intelligence. We propose an evolutionary method that can generate a large scale multitask model, and can support the dynamic and continuous addition of new tasks. The generated multitask model is sparsely activated and integrates a task-based routing that guarantees bounded compute cost and fewer added parameters per task as the model expands. The proposed method relies on a knowledge compartmentalization technique to achieve immunity against catastrophic forgetting and other common pitfalls such as gradient interference and negative transfer. We empirically show that the proposed method can jointly solve and achieve competitive results on 69image classification tasks, for example achieving the best test accuracy reported fora model trained only on public data for competitive tasks such as cifar10: 99.43%.
    MAPLE-X: Latency Prediction with Explicit Microprocessor Prior Knowledge. (arXiv:2205.12660v1 [cs.LG])
    Deep neural network (DNN) latency characterization is a time-consuming process and adds significant cost to Neural Architecture Search (NAS) processes when searching for efficient convolutional neural networks for embedded vision applications. DNN Latency is a hardware dependent metric and requires direct measurement or inference on target hardware. A recently introduced latency estimation technique known as MAPLE predicts DNN execution time on previously unseen hardware devices by using hardware performance counters. Leveraging these hardware counters in the form of an implicit prior, MAPLE achieves state-of-the-art performance in latency prediction. Here, we propose MAPLE-X which extends MAPLE by incorporating explicit prior knowledge of hardware devices and DNN architecture latency to better account for model stability and robustness. First, by identifying DNN architectures that exhibit a similar latency to each other, we can generate multiple virtual examples to significantly improve the accuracy over MAPLE. Secondly, the hardware specifications are used to determine the similarity between training and test hardware to emphasize training samples captured from comparable devices (domains) and encourages improved domain alignment. Experimental results using a convolution neural network NAS benchmark across different types of devices, including an Intel processor that is now used for embedded vision applications, demonstrate a 5% improvement over MAPLE and 9% over HELP. Furthermore, we include ablation studies to independently assess the benefits of virtual examples and hardware-based sample importance.
    Additive Logistic Mechanism for Privacy-Preserving Self-Supervised Learning. (arXiv:2205.12430v1 [cs.LG])
    We study the privacy risks that are associated with training a neural network's weights with self-supervised learning algorithms. Through empirical evidence, we show that the fine-tuning stage, in which the network weights are updated with an informative and often private dataset, is vulnerable to privacy attacks. To address the vulnerabilities, we design a post-training privacy-protection algorithm that adds noise to the fine-tuned weights and propose a novel differential privacy mechanism that samples noise from the logistic distribution. Compared to the two conventional additive noise mechanisms, namely the Laplace and the Gaussian mechanisms, the proposed mechanism uses a bell-shaped distribution that resembles the distribution of the Gaussian mechanism, and it satisfies pure $\epsilon$-differential privacy similar to the Laplace mechanism. We apply membership inference attacks on both unprotected and protected models to quantify the trade-off between the models' privacy and performance. We show that the proposed protection algorithm can effectively reduce the attack accuracy to roughly 50\%-equivalent to random guessing-while maintaining a performance loss below 5\%.
    PLAtE: A Large-scale Dataset for List Page Web Extraction. (arXiv:2205.12386v1 [cs.CL])
    Recently, neural models have been leveraged to significantly improve the performance of information extraction from semi-structured websites. However, a barrier for continued progress is the small number of datasets large enough to train these models. In this work, we introduce the PLAtE (Pages of Lists Attribute Extraction) dataset as a challenging new web extraction task. PLAtE focuses on shopping data, specifically extractions from product review pages with multiple items. PLAtE encompasses both the tasks of: (1) finding product-list segmentation boundaries and (2) extracting attributes for each product. PLAtE is composed of 53, 905 items from 6, 810 pages, making it the first large-scale list page web extraction dataset. We construct PLAtE by collecting list pages from Common Crawl, then annotating them on Mechanical Turk. Quantitative and qualitative analyses are performed to demonstrate PLAtE has high-quality annotations. We establish strong baseline performance on PLAtE with a SOTA model achieving an F1-score of 0.750 for attribute classification and 0.915 for segmentation, indicating opportunities for future research innovations in web extraction.
    A Universal Error Measure for Input Predictions Applied to Online Graph Problems. (arXiv:2205.12850v1 [cs.DS])
    We introduce a novel measure for quantifying the error in input predictions. The error is based on a minimum-cost hyperedge cover in a suitably defined hypergraph and provides a general template which we apply to online graph problems. The measure captures errors due to absent predicted requests as well as unpredicted actual requests; hence, predicted and actual inputs can be of arbitrary size. We achieve refined performance guarantees for previously studied network design problems in the online-list model, such as Steiner tree and facility location. Further, we initiate the study of learning-augmented algorithms for online routing problems, such as the traveling salesperson problem and dial-a-ride problem, where (transportation) requests arrive over time (online-time model). We provide a general algorithmic framework and we give error-dependent performance bounds that improve upon known worst-case barriers, when given accurate predictions, at the cost of slightly increased worst-case bounds when given predictions of arbitrary quality.
  • Open

    Algorithms for the Communication of Samples. (arXiv:2110.12805v3 [cs.IT] UPDATED)
    The efficient communication of noisy data has applications in several areas of machine learning, such as neural compression or differential privacy, and is also known as reverse channel coding or the channel simulation problem. Here we propose two new coding schemes with practical advantages over existing approaches. First, we introduce ordered random coding (ORC) which uses a simple trick to reduce the coding cost of previous approaches. This scheme further illuminates a connection between schemes based on importance sampling and the so-called Poisson functional representation. Second, we describe a hybrid coding scheme which uses dithered quantization to more efficiently communicate samples from distributions with bounded support.
    A Neural Tangent Kernel Formula for Ensembles of Soft Trees with Arbitrary Architectures. (arXiv:2205.12904v1 [cs.LG])
    A soft tree is an actively studied variant of a decision tree that updates splitting rules using the gradient method. Although it can have various tree architectures, the theoretical properties of their impact are not well known. In this paper, we formulate and analyze the Neural Tangent Kernel (NTK) induced by soft tree ensembles for arbitrary tree architectures. This kernel leads to the remarkable finding that only the number of leaves at each depth is relevant for the tree architecture in ensemble learning with infinitely many trees. In other words, if the number of leaves at each depth is fixed, the training behavior in function space and the generalization performance are exactly the same across different tree architectures, even if they are not isomorphic. We also show that the NTK of asymmetric trees like decision lists does not degenerate when they get infinitely deep. This is in contrast to the perfect binary trees, whose NTK is known to degenerate and leads to worse generalization performance for deeper trees.
    Deletion and Insertion Tests in Regression Models. (arXiv:2205.12423v1 [cs.LG])
    A basic task in explainable AI (XAI) is to identify the most important features behind a prediction made by a black box function $f$. The insertion and deletion tests of \cite{petsiuk2018rise} are used to judge the quality of algorithms that rank pixels from most to least important for a classification. Motivated by regression problems we establish a formula for their area under the curve (AUC) criteria in terms of certain main effects and interactions in an anchored decomposition of $f$. We find an expression for the expected value of the AUC under a random ordering of inputs to $f$ and propose an alternative area above a straight line for the regression setting. We use this criterion to compare feature importances computed by integrated gradients (IG) to those computed by Kernel SHAP (KS). Exact computation of KS grows exponentially with dimension, while that of IG grows linearly with dimension. In two data sets including binary variables we find that KS is superior to IG in insertion and deletion tests, but only by a very small amount. Our comparison problems include some binary inputs that pose a challenge to IG because it must use values between the possible variable levels. We show that IG will match KS when $f$ is an additive function plus a multilinear function of the variables. This includes a multilinear interpolation over the binary variables that would cause IG to have exponential cost in a naive implementation.
    To Impute or not to Impute? Missing Data in Treatment Effect Estimation. (arXiv:2202.02096v3 [stat.ML] UPDATED)
    Missing data is a systemic problem in practical scenarios that causes noise and bias when estimating treatment effects. This makes treatment effect estimation from data with missingness a particularly tricky endeavour. A key reason for this is that standard assumptions on missingness are rendered insufficient due to the presence of an additional variable, treatment, besides the individual and the outcome. Having a treatment variable introduces additional complexity with respect to why some variables are missing that is not fully explored by previous work. In our work we identify a new missingness mechanism, which we term mixed confounded missingness (MCM), where some missingness determines treatment selection and other missingness is determined by treatment selection. Given MCM, we show that naively imputing all data leads to poor performing treatment effects models, as the act of imputation effectively removes information necessary to provide unbiased estimates. However, no imputation at all also leads to biased estimates, as missingness determined by treatment divides the population in distinct subpopulations, where estimates across these populations will be biased. Our solution is selective imputation, where we use insights from MCM to inform precisely which variables should be imputed and which should not. We empirically demonstrate how various learners benefit from selective imputation compared to other solutions for missing data.
    Learning Mixtures of Linear Dynamical Systems. (arXiv:2201.11211v2 [stat.ML] UPDATED)
    We study the problem of learning a mixture of multiple linear dynamical systems (LDSs) from unlabeled short sample trajectories, each generated by one of the LDS models. Despite the wide applicability of mixture models for time-series data, learning algorithms that come with end-to-end performance guarantees are largely absent from existing literature. There are multiple sources of technical challenges, including but not limited to (1) the presence of latent variables (i.e. the unknown labels of trajectories); (2) the possibility that the sample trajectories might have lengths much smaller than the dimension $d$ of the LDS models; and (3) the complicated temporal dependence inherent to time-series data. To tackle these challenges, we develop a two-stage meta-algorithm, which is guaranteed to efficiently recover each ground-truth LDS model up to error $\tilde{O}(\sqrt{d/T})$, where $T$ is the total sample size. We validate our theoretical studies with numerical experiments, confirming the efficacy of the proposed algorithm.
    Learning Distributions by Generative Adversarial Networks: Approximation and Generalization. (arXiv:2205.12601v1 [cs.LG])
    We study how well generative adversarial networks (GAN) learn probability distributions from finite samples by analyzing the convergence rates of these models. Our analysis is based on a new oracle inequality that decomposes the estimation error of GAN into the discriminator and generator approximation errors, generalization error and optimization error. To estimate the discriminator approximation error, we establish error bounds on approximating H\"older functions by ReLU neural networks, with explicit upper bounds on the Lipschitz constant of the network or norm constraint on the weights. For generator approximation error, we show that neural network can approximately transform a low-dimensional source distribution to a high-dimensional target distribution and bound such approximation error by the width and depth of neural network. Combining the approximation results with generalization bounds of neural networks from statistical learning theory, we establish the convergence rates of GANs in various settings, when the error is measured by a collection of integral probability metrics defined through H\"older classes, including the Wasserstein distance as a special case. In particular, for distributions concentrated around a low-dimensional set, we show that the convergence rates of GANs do not depend on the high ambient dimension, but on the lower intrinsic dimension.
    Residuals-based distributionally robust optimization with covariate information. (arXiv:2012.01088v2 [math.OC] UPDATED)
    We consider data-driven approaches that integrate a machine learning prediction model within distributionally robust optimization (DRO) given limited joint observations of uncertain parameters and covariates. Our framework is flexible in the sense that it can accommodate a variety of regression setups and DRO ambiguity sets. We investigate asymptotic and finite sample properties of solutions obtained using Wasserstein, sample robust optimization, and phi-divergence-based ambiguity sets within our DRO formulations, and explore cross-validation approaches for sizing these ambiguity sets. Through numerical experiments, we validate our theoretical results, study the effectiveness of our approaches for sizing ambiguity sets, and illustrate the benefits of our DRO formulations in the limited data regime even when the prediction model is misspecified.
    A Kernel Stein Test for Comparing Latent Variable Models. (arXiv:1907.00586v4 [stat.ML] UPDATED)
    We propose a kernel-based nonparametric test of relative goodness of fit, where the goal is to compare two models, both of which may have unobserved latent variables, such that the marginal distribution of the observed variables is intractable. The proposed test generalizes the recently proposed kernel Stein discrepancy (KSD) tests (Liu et al., 2016, Chwialkowski et al., 2016, Yang et al., 2018) to the case of latent variable models, a much more general class than the fully observed models treated previously. The new test, with a properly calibrated threshold, has a well-controlled type-I error. In the case of certain models with low-dimensional latent structure and high-dimensional observations, our test significantly outperforms the relative Maximum Mean Discrepancy test, which is based on samples from the models and does not exploit the latent structure.
    Testing for Outliers with Conformal p-values. (arXiv:2104.08279v3 [stat.ME] UPDATED)
    This paper studies the construction of p-values for nonparametric outlier detection, taking a multiple-testing perspective. The goal is to test whether new independent samples belong to the same distribution as a reference data set or are outliers. We propose a solution based on conformal inference, a broadly applicable framework which yields p-values that are marginally valid but mutually dependent for different test points. We prove these p-values are positively dependent and enable exact false discovery rate control, although in a relatively weak marginal sense. We then introduce a new method to compute p-values that are both valid conditionally on the training data and independent of each other for different test points; this paves the way to stronger type-I error guarantees. Our results depart from classical conformal inference as we leverage concentration inequalities rather than combinatorial arguments to establish our finite-sample guarantees. Furthermore, our techniques also yield a uniform confidence bound for the false positive rate of any outlier detection algorithm, as a function of the threshold applied to its raw statistics. Finally, the relevance of our results is demonstrated by numerical experiments on real and simulated data.
    Probabilistic model-error assessment of deep learning proxies: an application to real-time inversion of borehole electromagnetic measurements. (arXiv:2205.12684v1 [physics.geo-ph])
    The advent of fast sensing technologies allows for real-time model updates in many applications where the model parameters are uncertain. Bayesian algorithms, such as ensemble smoothers, offer a real-time probabilistic inversion accounting for uncertainties. However, they rely on the repeated evaluation of the computational models, and deep neural network (DNN) based proxies can be useful to address this computational bottleneck. This paper studies the effects of the approximate nature of the deep learned models and associated model errors during the inversion of extra-deep borehole electromagnetic (EM) measurements, which are critical for geosteering. Using a deep neural network (DNN) as a forward model allows us to perform thousands of model evaluations within seconds, which is very useful for quantifying uncertainties and non-uniqueness in real-time. While significant efforts are usually made to ensure the accuracy of the DNN models, it is known that they contain unknown model errors in the regions not covered by the training data. When DNNs are utilized during inversion of EM measurements, the effects of the model errors could manifest themselves as a bias in the estimated input parameters and, consequently, might result in a low-quality geosteering decision. We present numerical results highlighting the challenges associated with the inversion of EM measurements while neglecting model error. We further demonstrate the utility of a recently proposed flexible iterative ensemble smoother in reducing the effect of model bias by capturing the unknown model errors, thus improving the quality of the estimated subsurface properties for geosteering operation. Moreover, we describe a procedure for identifying inversion multimodality and propose possible solutions to alleviate it in real-time.
    EGR: Equivariant Graph Refinement and Assessment of 3D Protein Complex Structures. (arXiv:2205.10390v2 [cs.LG] UPDATED)
    Protein complexes are macromolecules essential to the functioning and well-being of all living organisms. As the structure of a protein complex, in particular its region of interaction between multiple protein subunits (i.e., chains), has a notable influence on the biological function of the complex, computational methods that can quickly and effectively be used to refine and assess the quality of a protein complex's 3D structure can directly be used within a drug discovery pipeline to accelerate the development of new therapeutics and improve the efficacy of future vaccines. In this work, we introduce the Equivariant Graph Refiner (EGR), a novel E(3)-equivariant graph neural network (GNN) for multi-task structure refinement and assessment of protein complexes. Our experiments on new, diverse protein complex datasets, all of which we make publicly available in this work, demonstrate the state-of-the-art effectiveness of EGR for atomistic refinement and assessment of protein complexes and outline directions for future work in the field. In doing so, we establish a baseline for future studies in macromolecular refinement and structure analysis.
    Additive Logistic Mechanism for Privacy-Preserving Self-Supervised Learning. (arXiv:2205.12430v1 [cs.LG])
    We study the privacy risks that are associated with training a neural network's weights with self-supervised learning algorithms. Through empirical evidence, we show that the fine-tuning stage, in which the network weights are updated with an informative and often private dataset, is vulnerable to privacy attacks. To address the vulnerabilities, we design a post-training privacy-protection algorithm that adds noise to the fine-tuned weights and propose a novel differential privacy mechanism that samples noise from the logistic distribution. Compared to the two conventional additive noise mechanisms, namely the Laplace and the Gaussian mechanisms, the proposed mechanism uses a bell-shaped distribution that resembles the distribution of the Gaussian mechanism, and it satisfies pure $\epsilon$-differential privacy similar to the Laplace mechanism. We apply membership inference attacks on both unprotected and protected models to quantify the trade-off between the models' privacy and performance. We show that the proposed protection algorithm can effectively reduce the attack accuracy to roughly 50\%-equivalent to random guessing-while maintaining a performance loss below 5\%.
    When Is Partially Observable Reinforcement Learning Not Scary?. (arXiv:2204.08967v2 [cs.LG] UPDATED)
    Applications of Reinforcement Learning (RL), in which agents learn to make a sequence of decisions despite lacking complete information about the latent states of the controlled system, that is, they act under partial observability of the states, are ubiquitous. Partially observable RL can be notoriously difficult -- well-known information-theoretic results show that learning partially observable Markov decision processes (POMDPs) requires an exponential number of samples in the worst case. Yet, this does not rule out the existence of large subclasses of POMDPs over which learning is tractable. In this paper we identify such a subclass, which we call weakly revealing POMDPs. This family rules out the pathological instances of POMDPs where observations are uninformative to a degree that makes learning hard. We prove that for weakly revealing POMDPs, a simple algorithm combining optimism and Maximum Likelihood Estimation (MLE) is sufficient to guarantee polynomial sample complexity. To the best of our knowledge, this is the first provably sample-efficient result for learning from interactions in overcomplete POMDPs, where the number of latent states can be larger than the number of observations.
    Low-rank Optimal Transport: Approximation, Statistics and Debiasing. (arXiv:2205.12365v1 [stat.ML])
    The matching principles behind optimal transport (OT) play an increasingly important role in machine learning, a trend which can be observed when OT is used to disambiguate datasets in applications (e.g. single-cell genomics) or used to improve more complex methods (e.g. balanced attention in transformers or self-supervised learning). To scale to more challenging problems, there is a growing consensus that OT requires solvers that can operate on millions, not thousands, of points. The low-rank optimal transport (LOT) approach advocated in \cite{scetbon2021lowrank} holds several promises in that regard, and was shown to complement more established entropic regularization approaches, being able to insert itself in more complex pipelines, such as quadratic OT. LOT restricts the search for low-cost couplings to those that have a low-nonnegative rank, yielding linear time algorithms in cases of interest. However, these promises can only be fulfilled if the LOT approach is seen as a legitimate contender to entropic regularization when compared on properties of interest, where the scorecard typically includes theoretical properties (statistical bounds, relation to other methods) or practical aspects (debiasing, hyperparameter tuning, initialization). We target each of these areas in this paper in order to cement the impact of low-rank approaches in computational OT.
    Fast calculation of Gaussian Process multiple-fold cross-validation residuals and their covariances. (arXiv:2101.03108v2 [stat.ME] UPDATED)
    We generalize fast Gaussian process leave-one-out formulae to multiple-fold cross-validation, highlighting in turn in broad settings the covariance structure of cross-validation residuals. The employed approach, that relies on block matrix inversion via Schur complements, is applied to both Simple and Universal Kriging frameworks. We illustrate how resulting covariances affect model diagnostics and how to properly transform residuals in the first place. Beyond that, we examine how accounting for dependency between such residuals affect cross-validation-based estimation of the scale parameter. It is found in two distinct cases, namely in scale estimation and in broader covariance parameter estimation via pseudo-likelihood, that correcting for covariances between cross-validation residuals leads back to maximum likelihood estimation or to an original variation thereof. The proposed fast calculation of Gaussian Process multiple-fold cross-validation residuals is implemented and benchmarked against a naive implementation, all in R language. Numerical experiments highlight the accuracy of our approach as well as the substantial speed-ups that it enables. It is noticeable however, as supported by a discussion on the main drivers of computational costs and by a dedicated numerical benchmark, that speed-ups steeply decline as the number of folds (say, all sharing the same size) decreases. Overall, our results enable fast multiple-fold cross-validation, have direct consequences in GP model diagnostics, and pave the way to future work on hyperparameter fitting as well as on the promising field of goal-oriented fold design.
    Exact Convergence Rates of the Neural Tangent Kernel in the Large Depth Limit. (arXiv:1905.13654v11 [stat.ML] UPDATED)
    Recent work by Jacot et al. (2018) has shown that training a neural network using gradient descent in parameter space is related to kernel gradient descent in function space with respect to the Neural Tangent Kernel (NTK). Lee et al. (2019) built on this result by establishing that the output of a neural network trained using gradient descent can be approximated by a linear model when the network width is large. Indeed, under regularity conditions, the NTK converges to a time-independent kernel in the infinite-width limit. This regime is often called the NTK regime. In parallel, recent works on signal propagation (Poole et al., 2016; Schoenholz et al., 2017; Hayou et al., 2019a) studied the impact of the initialization and the activation function on signal propagation in deep neural networks. In this paper, we connect these two theories by quantifying the impact of the initialization and the activation function on the NTK when the network depth becomes large. In particular, we provide a comprehensive analysis of the convergence rates of the NTK regime to the infinite depth regime.
    Clustering consistency with Dirichlet process mixtures. (arXiv:2205.12924v1 [math.ST])
    Dirichlet process mixtures are flexible non-parametric models, particularly suited to density estimation and probabilistic clustering. In this work we study the posterior distribution induced by Dirichlet process mixtures as the sample size increases, and more specifically focus on consistency for the unknown number of clusters when the observed data are generated from a finite mixture. Crucially, we consider the situation where a prior is placed on the concentration parameter of the underlying Dirichlet process. Previous findings in the literature suggest that Dirichlet process mixtures are typically not consistent for the number of clusters if the concentration parameter is held fixed and data come from a finite mixture. Here we show that consistency for the number of clusters can be achieved if the concentration parameter is adapted in a fully Bayesian way, as commonly done in practice. Our results are derived for data coming from a class of finite mixtures, with mild assumptions on the prior for the concentration parameter and for a variety of choices of likelihood kernels for the mixture.
    Tell me why! Explanations support learning relational and causal structure. (arXiv:2112.03753v3 [cs.LG] UPDATED)
    Inferring the abstract relational and causal structure of the world is a major challenge for reinforcement-learning (RL) agents. For humans, language--particularly in the form of explanations--plays a considerable role in overcoming this challenge. Here, we show that language can play a similar role for deep RL agents in complex environments. While agents typically struggle to acquire relational and causal knowledge, augmenting their experience by training them to predict language descriptions and explanations can overcome these limitations. We show that language can help agents learn challenging relational tasks, and examine which aspects of language contribute to its benefits. We then show that explanations can help agents to infer not only relational but also causal structure. Language can shape the way that agents to generalize out-of-distribution from ambiguous, causally-confounded training, and explanations even allow agents to learn to perform experimental interventions to identify causal relationships. Our results suggest that language description and explanation may be powerful tools for improving agent learning and generalization.
    Non-stationary Bandits with Knapsacks. (arXiv:2205.12427v1 [cs.LG])
    In this paper, we study the problem of bandits with knapsacks (BwK) in a non-stationary environment. The BwK problem generalizes the multi-arm bandit (MAB) problem to model the resource consumption associated with playing each arm. At each time, the decision maker/player chooses to play an arm, and s/he will receive a reward and consume certain amount of resource from each of the multiple resource types. The objective is to maximize the cumulative reward over a finite horizon subject to some knapsack constraints on the resources. Existing works study the BwK problem under either a stochastic or adversarial environment. Our paper considers a non-stationary environment which continuously interpolates between these two extremes. We first show that the traditional notion of variation budget is insufficient to characterize the non-stationarity of the BwK problem for a sublinear regret due to the presence of the constraints, and then we propose a new notion of global non-stationarity measure. We employ both non-stationarity measures to derive upper and lower bounds for the problem. Our results are based on a primal-dual analysis of the underlying linear programs and highlight the interplay between the constraints and the non-stationarity. Finally, we also extend the non-stationarity measure to the problem of online convex optimization with constraints and obtain new regret bounds accordingly.
    Differentially Private Data Generation Needs Better Features. (arXiv:2205.12900v1 [stat.ML])
    Training even moderately-sized generative models with differentially-private stochastic gradient descent (DP-SGD) is difficult: the required level of noise for reasonable levels of privacy is simply too large. We advocate instead building off a good, relevant representation on public data, then using private data only for "transfer learning." In particular, we minimize the maximum mean discrepancy (MMD) between private target data and the generated distribution, using a kernel based on perceptual features from a public dataset. With the MMD, we can simply privatize the data-dependent term once and for all, rather than introducing noise at each step of optimization as in DP-SGD. Our algorithm allows us to generate CIFAR10-level images faithfully with $\varepsilon \approx 2$, far surpassing the current state of the art, which only models MNIST and FashionMNIST at $\varepsilon \approx 10$. Our work introduces simple yet powerful foundations for reducing the gap between private and non-private deep generative models.
    Imposing Gaussian Pre-Activations in a Neural Network. (arXiv:2205.12379v1 [cs.LG])
    The goal of the present work is to propose a way to modify both the initialization distribution of the weights of a neural network and its activation function, such that all pre-activations are Gaussian. We propose a family of pairs initialization/activation, where the activation functions span a continuum from bounded functions (such as Heaviside or tanh) to the identity function. This work is motivated by the contradiction between existing works dealing with Gaussian pre-activations: on one side, the works in the line of the Neural Tangent Kernels and the Edge of Chaos are assuming it, while on the other side, theoretical and experimental results challenge this hypothesis. The family of pairs initialization/activation we are proposing will help us to answer this hot question: is it desirable to have Gaussian pre-activations in a neural network?
    Understanding Programmatic Weak Supervision via Source-aware Influence Function. (arXiv:2205.12879v1 [cs.LG])
    Programmatic Weak Supervision (PWS) aggregates the source votes of multiple weak supervision sources into probabilistic training labels, which are in turn used to train an end model. With its increasing popularity, it is critical to have some tool for users to understand the influence of each component (e.g., the source vote or training data) in the pipeline and interpret the end model behavior. To achieve this, we build on Influence Function (IF) and propose source-aware IF, which leverages the generation process of the probabilistic labels to decompose the end model's training objective and then calculate the influence associated with each (data, source, class) tuple. These primitive influence score can then be used to estimate the influence of individual component of PWS, such as source vote, supervision source, and training data. On datasets of diverse domains, we demonstrate multiple use cases: (1) interpreting incorrect predictions from multiple angles that reveals insights for debugging the PWS pipeline, (2) identifying mislabeling of sources with a gain of 9%-37% over baselines, and (3) improving the end model's generalization performance by removing harmful components in the training objective (13%-24% better than ordinary IF).
    Transportation-Inequalities, Lyapunov Stability and Sampling for Dynamical Systems on Continuous State Space. (arXiv:2205.12448v1 [stat.ML])
    We study the concentration phenomenon for discrete-time random dynamical systems with an unbounded state space. We develop a heuristic approach towards obtaining exponential concentration inequalities for dynamical systems using an entirely functional analytic framework. We also show that existence of exponential-type Lyapunov function, compared to the purely deterministic setting, not only implies stability but also exponential concentration inequalities for sampling from the stationary distribution, via \emph{transport-entropy inequality} (T-E). These results have significant impact in \emph{reinforcement learning} (RL) and \emph{controls}, leading to exponential concentration inequalities even for unbounded observables, while neither assuming reversibility nor exact knowledge of random dynamical system (assumptions at heart of concentration inequalities in statistical mechanics and Markov diffusion processes).
    Learning from time-dependent streaming data with online stochastic algorithms. (arXiv:2205.12549v1 [cs.LG])
    We study stochastic algorithms in a streaming framework, trained on samples coming from a dependent data source. In this streaming framework, we analyze the convergence of Stochastic Gradient (SG) methods in a non-asymptotic manner; this includes various SG methods such as the well-known stochastic gradient descent (i.e., Robbins-Monro algorithm), mini-batch SG methods, together with their averaged estimates (i.e., Polyak-Ruppert averaged). Our results form a heuristic by linking the level of dependency and convexity to the rest of the model parameters. This heuristic provides new insights into choosing the optimal learning rate, which can help increase the stability of SGbased methods; these investigations suggest large streaming batches with slow decaying learning rates for highly dependent data sources.
    Conformal Prediction Intervals with Temporal Dependence. (arXiv:2205.12940v1 [stat.ML])
    Cross-sectional prediction is common in many domains such as healthcare, including forecasting tasks using electronic health records, where different patients form a cross-section. We focus on the task of constructing valid prediction intervals (PIs) in time-series regression with a cross-section. A prediction interval is considered valid if it covers the true response with (a pre-specified) high probability. We first distinguish between two notions of validity in such a setting: cross-sectional and longitudinal. Cross-sectional validity is concerned with validity across the cross-section of the time series data, while longitudinal validity accounts for the temporal dimension. Coverage guarantees along both these dimensions are ideally desirable; however, we show that distribution-free longitudinal validity is theoretically impossible. Despite this limitation, we propose Conformal Prediction with Temporal Dependence (CPTD), a procedure which is able to maintain strict cross-sectional validity while improving longitudinal coverage. CPTD is post-hoc and light-weight, and can easily be used in conjunction with any prediction model as long as a calibration set is available. We focus on neural networks due to their ability to model complicated data such as diagnosis codes for time-series regression, and perform extensive experimental validation to verify the efficacy of our approach. We find that CPTD outperforms baselines on a variety of datasets by improving longitudinal coverage and often providing more efficient (narrower) PIs.
    Machine learning method for return direction forecasting of Exchange Traded Funds using classification and regression models. (arXiv:2205.12746v1 [q-fin.CP])
    This article aims to propose and apply a machine learning method to analyze the direction of returns from Exchange Traded Funds (ETFs) using the historical return data of its components, helping to make investment strategy decisions through a trading algorithm. In methodological terms, regression and classification models were applied, using standard datasets from Brazilian and American markets, in addition to algorithmic error metrics. In terms of research results, they were analyzed and compared to those of the Na\"ive forecast and the returns obtained by the buy & hold technique in the same period of time. In terms of risk and return, the models mostly performed better than the control metrics, with emphasis on the linear regression model and the classification models by logistic regression, support vector machine (using the LinearSVC model), Gaussian Naive Bayes and K-Nearest Neighbors, where in certain datasets the returns exceeded by two times and the Sharpe ratio by up to four times those of the buy & hold control model.
    Mitigating multiple descents: A model-agnostic framework for risk monotonization. (arXiv:2205.12937v1 [math.ST])
    Recent empirical and theoretical analyses of several commonly used prediction procedures reveal a peculiar risk behavior in high dimensions, referred to as double/multiple descent, in which the asymptotic risk is a non-monotonic function of the limiting aspect ratio of the number of features or parameters to the sample size. To mitigate this undesirable behavior, we develop a general framework for risk monotonization based on cross-validation that takes as input a generic prediction procedure and returns a modified procedure whose out-of-sample prediction risk is, asymptotically, monotonic in the limiting aspect ratio. As part of our framework, we propose two data-driven methodologies, namely zero- and one-step, that are akin to bagging and boosting, respectively, and show that, under very mild assumptions, they provably achieve monotonic asymptotic risk behavior. Our results are applicable to a broad variety of prediction procedures and loss functions, and do not require a well-specified (parametric) model. We exemplify our framework with concrete analyses of the minimum $\ell_2$, $\ell_1$-norm least squares prediction procedures. As one of the ingredients in our analysis, we also derive novel additive and multiplicative forms of oracle risk inequalities for split cross-validation that are of independent interest.
    Linear Algorithms for Nonparametric Multiclass Probability Estimation. (arXiv:2205.12460v1 [stat.ME])
    Multiclass probability estimation is the problem of estimating conditional probabilities of a data point belonging to a class given its covariate information. It has broad applications in statistical analysis and data science. Recently a class of weighted Support Vector Machines (wSVMs) have been developed to estimate class probabilities through ensemble learning for $K$-class problems (Wang, Shen and Liu, 2008; Wang, Zhang and Wu, 2019), where $K$ is the number of classes. The estimators are robust and achieve high accuracy for probability estimation, but their learning is implemented through pairwise coupling, which demand polynomial time in $K$. In this paper, we propose two new learning schemes, the baseline learning and the One-vs-All (OVA) learning, to further improve wSVMs in terms of computational efficiency and estimation accuracy. In particular, the baseline learning has optimal computational complexity in the sense that it is linear in $K$. The resulting estimators are distribution-free and shown to be consistent. We further conduct extensive numerical experiments to demonstrate finite sample performance.
    Physics-guided Deep Markov Models for Learning Nonlinear Dynamical Systems with Uncertainty. (arXiv:2110.08607v3 [cs.LG] UPDATED)
    In this paper, we propose a probabilistic physics-guided framework, termed Physics-guided Deep Markov Model (PgDMM). The framework targets the inference of the characteristics and latent structure of nonlinear dynamical systems from measurement data, where exact inference of latent variables is typically intractable. A recently surfaced option pertains to leveraging variational inference to perform approximate inference. In such a scheme, transition and emission functions of the system are parameterized via feed-forward neural networks (deep generative models). However, due to the generalized and highly versatile formulation of neural network functions, the learned latent space often lacks physical interpretation and structured representation. To address this, we bridge physics-based state space models with Deep Markov Models, thus delivering a hybrid modeling framework for unsupervised learning and identification of nonlinear dynamical systems. The proposed framework takes advantage of the expressive power of deep learning, while retaining the driving physics of the dynamical system by imposing physics-driven restrictions on the side of the latent space. We demonstrate the benefits of such a fusion in terms of achieving improved performance on illustrative simulation examples and experimental case studies of nonlinear systems. Our results indicate that the physics-based models involved in the employed transition and emission functions essentially enforce a more structured and physically interpretable latent space, which is essential for enhancing and generalizing the predictive capabilities of deep learning-based models.
    Multi-Agent Low-Dimensional Linear Bandits. (arXiv:2007.01442v4 [cs.LG] UPDATED)
    We study a multi-agent stochastic linear bandit with side information, parameterized by an unknown vector $\theta^* \in \mathbb{R}^d$. The side information consists of a finite collection of low-dimensional subspaces, one of which contains $\theta^*$. In our setting, agents can collaborate to reduce regret by sending recommendations across a communication graph connecting them. We present a novel decentralized algorithm, where agents communicate subspace indices with each other and each agent plays a projected variant of LinUCB on the corresponding (low-dimensional) subspace. By distributing the search for the optimal subspace across users and learning of the unknown vector by each agent in the corresponding low-dimensional subspace, we show that the per-agent finite-time regret is much smaller than the case when agents do not communicate. We finally complement these results through simulations.
    Taming Nonconvexity in Kernel Feature Selection -- Favorable Properties of the Laplace Kernel. (arXiv:2106.09387v3 [math.ST] UPDATED)
    Kernel-based feature selection is an important tool in nonparametric statistics. Despite many practical applications of kernel-based feature selection, there is little statistical theory available to support the method. A core challenge is the objective function of the optimization problems used to define kernel-based feature selection are nonconvex. The literature has only studied the statistical properties of the \emph{global optima}, which is a mismatch, given that the gradient-based algorithms available for nonconvex optimization are only able to guarantee convergence to local minima. Studying the full landscape associated with kernel-based methods, we show that feature selection objectives using the Laplace kernel (and other $\ell_1$ kernels) come with statistical guarantees that other kernels, including the ubiquitous Gaussian kernel (or other $\ell_2$ kernels) do not possess. Based on a sharp characterization of the gradient of the objective function, we show that $\ell_1$ kernels eliminate unfavorable stationary points that appear when using an $\ell_2$ kernel. Armed with this insight, we establish statistical guarantees for $\ell_1$ kernel-based feature selection which do not require reaching the global minima. In particular, we establish model-selection consistency of $\ell_1$-kernel-based feature selection in recovering main effects and hierarchical interactions in the nonparametric setting with $n \sim \log p$ samples.
    Learning the Travelling Salesperson Problem Requires Rethinking Generalization. (arXiv:2006.07054v6 [cs.LG] UPDATED)
    End-to-end training of neural network solvers for graph combinatorial optimization problems such as the Travelling Salesperson Problem (TSP) have seen a surge of interest recently, but remain intractable and inefficient beyond graphs with few hundreds of nodes. While state-of-the-art learning-driven approaches for TSP perform closely to classical solvers when trained on trivially small sizes, they are unable to generalize the learnt policy to larger instances at practical scales. This work presents an end-to-end neural combinatorial optimization pipeline that unifies several recent papers in order to identify the inductive biases, model architectures and learning algorithms that promote generalization to instances larger than those seen in training. Our controlled experiments provide the first principled investigation into such zero-shot generalization, revealing that extrapolating beyond training data requires rethinking the neural combinatorial optimization pipeline, from network layers and learning paradigms to evaluation protocols. Additionally, we analyze recent advances in deep learning for routing problems through the lens of our pipeline and provide new directions to stimulate future research.
    Adaptively Exploiting d-Separators with Causal Bandits. (arXiv:2202.05100v2 [stat.ML] UPDATED)
    Multi-armed bandit problems provide a framework to identify the optimal intervention over a sequence of repeated experiments. Without additional assumptions, minimax optimal performance (measured by cumulative regret) is well-understood. With access to additional observed variables that d-separate the intervention from the outcome (i.e., they are a d-separator), recent "causal bandit" algorithms provably incur less regret. However, in practice it is desirable to be agnostic to whether observed variables are a d-separator. Ideally, an algorithm should be adaptive; that is, perform nearly as well as an algorithm with oracle knowledge of the presence or absence of a d-separator. In this work, we formalize and study this notion of adaptivity, and provide a novel algorithm that simultaneously achieves (a) optimal regret when a d-separator is observed, improving on classical minimax algorithms, and (b) significantly smaller regret than recent causal bandit algorithms when the observed variables are not a d-separator. Crucially, our algorithm does not require any oracle knowledge of whether a d-separator is observed. We also generalize this adaptivity to other conditions, such as the front-door criterion.
    FastAdaBelief: Improving Convergence Rate for Belief-based Adaptive Optimizers by Exploiting Strong Convexity. (arXiv:2104.13790v3 [cs.LG] UPDATED)
    AdaBelief, one of the current best optimizers, demonstrates superior generalization ability compared to the popular Adam algorithm by viewing the exponential moving average of observed gradients. AdaBelief is theoretically appealing in that it has a data-dependent $O(\sqrt{T})$ regret bound when objective functions are convex, where $T$ is a time horizon. It remains however an open problem whether the convergence rate can be further improved without sacrificing its generalization ability. %on how to exploit strong convexity to further improve the convergence rate of AdaBelief. To this end, we make a first attempt in this work and design a novel optimization algorithm called FastAdaBelief that aims to exploit its strong convexity in order to achieve an even faster convergence rate. In particular, by adjusting the step size that better considers strong convexity and prevents fluctuation, our proposed FastAdaBelief demonstrates excellent generalization ability as well as superior convergence. As an important theoretical contribution, we prove that FastAdaBelief attains a data-dependant $O(\log T)$ regret bound, which is substantially lower than AdaBelief. On the empirical side, we validate our theoretical analysis with extensive experiments in both scenarios of strong and non-strong convexity on three popular baseline models. Experimental results are very encouraging: FastAdaBelief converges the quickest in comparison to all mainstream algorithms while maintaining an excellent generalization ability, in cases of both strong or non-strong convexity. FastAdaBelief is thus posited as a new benchmark model for the research community.
    Tiered Reinforcement Learning: Pessimism in the Face of Uncertainty and Constant Regret. (arXiv:2205.12418v1 [cs.LG])
    We propose a new learning framework that captures the tiered structure of many real-world user-interaction applications, where the users can be divided into two groups based on their different tolerance on exploration risks and should be treated separately. In this setting, we simultaneously maintain two policies $\pi^{\text{O}}$ and $\pi^{\text{E}}$: $\pi^{\text{O}}$ ("O" for "online") interacts with more risk-tolerant users from the first tier and minimizes regret by balancing exploration and exploitation as usual, while $\pi^{\text{E}}$ ("E" for "exploit") exclusively focuses on exploitation for risk-averse users from the second tier utilizing the data collected so far. An important question is whether such a separation yields advantages over the standard online setting (i.e., $\pi^{\text{E}}=\pi^{\text{O}}$) for the risk-averse users. We individually consider the gap-independent vs.~gap-dependent settings. For the former, we prove that the separation is indeed not beneficial from a minimax perspective. For the latter, we show that if choosing Pessimistic Value Iteration as the exploitation algorithm to produce $\pi^{\text{E}}$, we can achieve a constant regret for risk-averse users independent of the number of episodes $K$, which is in sharp contrast to the $\Omega(\log K)$ regret for any online RL algorithms in the same setting, while the regret of $\pi^{\text{O}}$ (almost) maintains its online regret optimality and does not need to compromise for the success of $\pi^{\text{E}}$.
    Amortized Inference for Causal Structure Learning. (arXiv:2205.12934v1 [cs.LG])
    Learning causal structure poses a combinatorial search problem that typically involves evaluating structures using a score or independence test. The resulting search is costly, and designing suitable scores or tests that capture prior knowledge is difficult. In this work, we propose to amortize the process of causal structure learning. Rather than searching over causal structures directly, we train a variational inference model to predict the causal structure from observational/interventional data. Our inference model acquires domain-specific inductive bias for causal discovery solely from data generated by a simulator. This allows us to bypass both the search over graphs and the hand-engineering of suitable score functions. Moreover, the architecture of our inference model is permutation invariant w.r.t. the data points and permutation equivariant w.r.t. the variables, facilitating generalization to significantly larger problem instances than seen during training. On synthetic data and semi-synthetic gene expression data, our models exhibit robust generalization capabilities under substantial distribution shift and significantly outperform existing algorithms, especially in the challenging genomics domain.
    On Representation Knowledge Distillation for Graph Neural Networks. (arXiv:2111.04964v2 [cs.LG] UPDATED)
    Knowledge distillation is a learning paradigm for boosting resource-efficient graph neural networks (GNNs) using more expressive yet cumbersome teacher models. Past work on distillation for GNNs proposed the Local Structure Preserving loss (LSP), which matches local structural relationships defined over edges across the student and teacher's node embeddings. This paper studies whether preserving the global topology of how the teacher embeds graph data can be a more effective distillation objective for GNNs, as real-world graphs often contain latent interactions and noisy edges. We propose Graph Contrastive Representation Distillation (G-CRD), which uses contrastive learning to implicitly preserve global topology by aligning the student node embeddings to those of the teacher in a shared representation space. Additionally, we introduce an expanded set of benchmarks on large-scale real-world datasets where the performance gap between teacher and student GNNs is non-negligible. Experiments across 4 datasets and 14 heterogeneous GNN architectures show that G-CRD consistently boosts the performance and robustness of lightweight GNNs, outperforming LSP (and a global structure preserving variant of LSP) as well as baselines from 2D computer vision. An analysis of the representational similarity among teacher and student embedding spaces reveals that G-CRD balances preserving local and global relationships, while structure preserving approaches are best at preserving one or the other.
    A scalable multi-step least squares method for network identification with unknown disturbance topology. (arXiv:2106.07548v3 [eess.SY] UPDATED)
    Identification methods for dynamic networks typically require prior knowledge of the network and disturbance topology, and often rely on solving poorly scalable non-convex optimization problems. While methods for estimating network topology are available in the literature, less attention has been paid to estimating the disturbance topology, i.e., the (spatial) noise correlation structure and the noise rank in a filtered white noise representation of the disturbance signal. In this work we present an identification method for dynamic networks, in which an estimation of the disturbance topology precedes the identification of the full dynamic network with known network topology. To this end we extend the multi-step Sequential Linear Regression and Weighted Null Space Fitting methods to deal with reduced rank noise, and use these methods to estimate the disturbance topology and the network dynamics in the full measurement situation. As a result, we provide a multi-step least squares algorithm with parallel computation capabilities and that rely only on explicit analytical solutions, thereby avoiding the usual non-convex optimizations involved. Consequently we consistently estimate dynamic networks of Box Jenkins model structure, while keeping the computational burden low. We provide a consistency proof that includes path-based data informativity conditions for allocation of excitation signals in the experimental design. Numerical simulations performed on a dynamic network with reduced rank noise clearly illustrate the potential of this method.
    Surprises in adversarially-trained linear regression. (arXiv:2205.12695v1 [stat.ML])
    State-of-the-art machine learning models can be vulnerable to very small input perturbations that are adversarially constructed. Adversarial training is one of the most effective approaches to defend against such examples. We show that for linear regression problems, adversarial training can be formulated as a convex problem. This fact is then used to show that $\ell_\infty$-adversarial training produces sparse solutions and has many similarities to the lasso method. Similarly, $\ell_2$-adversarial training has similarities with ridge regression. We use a robust regression framework to analyze and understand these similarities and also point to some differences. Finally, we show how adversarial training behaves differently from other regularization methods when estimating overparameterized models (i.e., models with more parameters than datapoints). It minimizes a sum of three terms which regularizes the solution, but unlike lasso and ridge regression, it can sharply transition into an interpolation mode. We show that for sufficiently many features or sufficiently small regularization parameters, the learned model perfectly interpolates the training data while still exhibiting good out-of-sample performance.  ( 2 min )
    On the Interpretability of Regularisation for Neural Networks Through Model Gradient Similarity. (arXiv:2205.12642v1 [stat.ML])
    Most complex machine learning and modelling techniques are prone to over-fitting and may subsequently generalise poorly to future data. Artificial neural networks are no different in this regard and, despite having a level of implicit regularisation when trained with gradient descent, often require the aid of explicit regularisers. We introduce a new framework, Model Gradient Similarity (MGS), that (1) serves as a metric of regularisation, which can be used to monitor neural network training, (2) adds insight into how explicit regularisers, while derived from widely different principles, operate via the same mechanism underneath by increasing MGS, and (3) provides the basis for a new regularisation scheme which exhibits excellent performance, especially in challenging settings such as high levels of label noise or limited sample sizes.  ( 2 min )
    Rethinking Fano's Inequality in Ensemble Learning. (arXiv:2205.12683v1 [cs.LG])
    We propose a fundamental theory on ensemble learning that evaluates a given ensemble system by a well-grounded set of metrics. Previous studies used a variant of Fano's inequality of information theory and derived a lower bound of the classification error rate on the basis of the accuracy and diversity of models. We revisit the original Fano's inequality and argue that the studies did not take into account the information lost when multiple model predictions are combined into a final prediction. To address this issue, we generalize the previous theory to incorporate the information loss. Further, we empirically validate and demonstrate the proposed theory through extensive experiments on actual systems. The theory reveals the strengths and weaknesses of systems on each metric, which will push the theoretical understanding of ensemble learning and give us insights into designing systems.  ( 2 min )
    Gradient-based explanations for Gaussian Process regression and classification models. (arXiv:2205.12797v1 [cs.LG])
    Gaussian Processes (GPs) have proven themselves as a reliable and effective method in probabilistic Machine Learning. Thanks to recent and current advances, modeling complex data with GPs is becoming more and more feasible. Thus, these types of models are, nowadays, an interesting alternative to Neural and Deep Learning methods, which are arguably the current state-of-the-art in Machine Learning. For the latter, we see an increasing interest in so-called explainable approaches - in essence methods that aim to make a Machine Learning model's decision process transparent to humans. Such methods are particularly needed when illogical or biased reasoning can lead to actual disadvantageous consequences for humans. Ideally, explainable Machine Learning should help detect such flaws in a model and aid a subsequent debugging process. One active line of research in Machine Learning explainability are gradient-based methods, which have been successfully applied to complex neural networks. Given that GPs are closed under differentiation, gradient-based explainability for GPs appears as a promising field of research. This paper is primarily focused on explaining GP classifiers via gradients where, contrary to GP regression, derivative GPs are not straightforward to obtain.  ( 2 min )
    Deep interpretable ensembles. (arXiv:2205.12729v1 [stat.ML])
    Ensembles improve prediction performance and allow uncertainty quantification by aggregating predictions from multiple models. In deep ensembling, the individual models are usually black box neural networks, or recently, partially interpretable semi-structured deep transformation models. However, interpretability of the ensemble members is generally lost upon aggregation. This is a crucial drawback of deep ensembles in high-stake decision fields, in which interpretable models are desired. We propose a novel transformation ensemble which aggregates probabilistic predictions with the guarantee to preserve interpretability and yield uniformly better predictions than the ensemble members on average. Transformation ensembles are tailored towards interpretable deep transformation models but are applicable to a wider range of probabilistic neural networks. In experiments on several publicly available data sets, we demonstrate that transformation ensembles perform on par with classical deep ensembles in terms of prediction performance, discrimination, and calibration. In addition, we demonstrate how transformation ensembles quantify both aleatoric and epistemic uncertainty, and produce minimax optimal predictions under certain conditions.  ( 2 min )
    Multimodal active speaker detection and virtual cinematography for video conferencing. (arXiv:2002.03977v3 [eess.AS] UPDATED)
    Active speaker detection (ASD) and virtual cinematography (VC) can significantly improve the remote user experience of a video conference by automatically panning, tilting and zooming of a video conferencing camera: users subjectively rate an expert video cinematographer's video significantly higher than unedited video. We describe a new automated ASD and VC that performs within 0.3 MOS of an expert cinematographer based on subjective ratings with a 1-5 scale. This system uses a 4K wide-FOV camera, a depth camera, and a microphone array; it extracts features from each modality and trains an ASD using an AdaBoost machine learning system that is very efficient and runs in real-time. A VC is similarly trained using machine learning to optimize the subjective quality of the overall experience. To avoid distracting the room participants and reduce switching latency the system has no moving parts -- the VC works by cropping and zooming the 4K wide-FOV video stream. The system was tuned and evaluated using extensive crowdsourcing techniques and evaluated on a dataset with N=100 meetings, each 2-5 minutes in length.  ( 2 min )

  • Open

    how to start reinforcement learning, as a electrical and electronic control engineer?
    submitted by /u/Ibrahim_Attawil [link] [comments]  ( 1 min )
    Sneak Peak of Animo Island, a PC game that empowers players to explore reinforcement learning as a game mechanic - Currently looking for play testers for an upcoming beta release!
    submitted by /u/AnimoIsland [link] [comments]  ( 1 min )
    What universities are hubs for reinforcement learning research?
    I am currently working through Reinforcement Learning by Sutton and Barto. It is a great book, but since it is such a sub-speciality of machine learning I am wondering what universities have research being done in the field and not just machine learning in general? Where are top papers coming from? submitted by /u/caedin8 [link] [comments]  ( 1 min )
    [D] confusion about KL divergence in PPO (implementation confusion)
    Hi all, I am following this workflow for PPO using keras and I have come across something that doesn't make sense to me, any help and clarification is appreciated - I am following PPO keras . In this line of code when calculating the ratio: def train_policy( observation_buffer, action_buffer, logprobability_buffer, advantage_buffer ): with tf.GradientTape() as tape: # Record operations for automatic differentiation. ratio = tf.exp( logprobabilities(actor(observation_buffer), action_buffer) - logprobability_buffer) I am confused about the subtraction from the logprobability buffer. ​ the line that preceeds this is the collection of information from the buffer: # Get values from the buffer ( observation_buffer, action_buffer, advantage_buffer, return_buffer, logprobability_buffer, ) = buffer.get() I feel like there is some tautology here although i am probably missing something very obvious(with some misunderstanding of KL divergence/importance sampling i'm sure). The log-probabilities function takes the logits from the actor model, then carries out a softmax over those logits - takes the action that was selected, and then returns the log of the action that was selected. but in the train step, we take the observation and actions from the buffer (the same ones for the epoch that has just been run)... call the log probabilities to take a batch of log probs with each state-action pair and the subtract from the log-probabilities from the buffer. however, the log-probabilities from the buffer ARE the same log-probabilities that the actor model will spit out, as they are the same observations and actions that led to log probs being inserted in to the buffer - and, the weights have not been updated by this point and so the softmax will be identical. what is the point of subtracting the same log-probabilities from the log-probability buffer? i don't see here the traditional 'old' vs 'new' policy. thank you! submitted by /u/amjass12 [link] [comments]  ( 2 min )
    do you guys have any ressources/tutorials on how to implement your own RL environment/Agents
    i'm new to RL and i'm trying to use it for a classification task ... yet i'm not quite sure how should i intialize the environment, agents.. any recommendations? submitted by /u/Affectionate_Worth43 [link] [comments]  ( 1 min )
    What are the types of tasks we can perform given an environment in MA Reinforcement Learning?
    Hello, I am new to reinforcement learning and I need to find a game environment and make agents corporate in an optimal way. I found some environments (petting zoo, ML agents Soccer-twos..). I read about RL algorithms such as MADDPG and my intuition was to integrate such algorithms into some environments I found. I want to know if this task is interesting, and any suggestions for tasks we can execute on MARL envs. submitted by /u/Ok_Lab_2750 [link] [comments]  ( 1 min )
    "HyperTree Proof Search for Neural Theorem Proving", Lemple et al 2022 {FB} (56% -> 65% MetaMath proofs)
    submitted by /u/gwern [link] [comments]  ( 1 min )
  • Open

    [P] Scale ML experiments from JupyterLab to the cloud
    First Medium article is out! Come see how Optumi is thinking about the shifting workflow needs of data science and machine learning professionals. https://medium.com/@optumi/scale-ml-experiments-from-jupyterlab-to-the-cloud-141bd645d8e9 submitted by /u/chrismarrie [link] [comments]  ( 1 min )
    [D] Google Imagen authors now produce images based on your prompt!
    If you are interested in getting your text converted to an image by Google Brain Imagen use the following link: https://twitter.com/mo_norouzi/status/1529497457234780162?s=20&t=3K_M972bMeGRR2wG6kobHQ submitted by /u/aifordummies [link] [comments]
    [D] From classification to regression and some physics analogies
    Hello, Here is: https://www.researchgate.net/publication/360541388_From_Classification_to_Regression_A_Note_on_Deodata_Predictors The paper describes how to adapt a set of classification algorithms in order to perform nonlinear regression. The algorithms are described with simple numerical examples. In the "Field Sampling Density" section, the described operation is akin to estimating the strength of a field. I am interested in your opinions. Thanks. submitted by /u/crispub [link] [comments]  ( 1 min )
    [N] Pull Requests and Discussions on Hugging Face
    Hey, it's Merve from Hugging Face 👋 I wanted to share some big news that I hope you find useful. The 🤗 Hub now has pull requests (PR) and discussions in repositories to improve collaboration in machine learning 🥳✨ What does PR really mean here? Let’s assume you have a big PyTorch model and someone else ported it to TensorFlow, that person can contribute that to your model repository. Someone else can open a PR to improve your model, fix your machine learning demo in a Space or change anything in the dataset. This applies for model/Space/dataset (any repo) repositories on the hub. You might say this sounds familiar to GitHub. For code, GitHub works super well and we don’t want to (and it would be very inefficient to) recreate the feature set of GitHub. What we want to focus on is creating the collaboration toolset that’s optimized for ML. You can learn more about these new features here: https://huggingface.co/blog/community-update. Looking forward to your feedback and suggestions! ✨ Hope this is useful 🙂 submitted by /u/unofficialmerve [link] [comments]  ( 2 min )
    [D] PyTorch processes taking up tons of GPU memory - any way to reduce this?
    I am running on Arch Linux 5.17.9-arch1-1 with an NVIDIA GeForce RTX 3090 GPU. I need to run multiple processes for a reinforcement learning task, where each subprocess runs the data collection (and inference) and all the samples from that are then retrieved via queues in the main process and optimized (e.g. think of PPO but distributed, like IMPALA). I am using torch.multiprocessing for this. Unfortunately, the multiple spawned subprocesses cause A LOT of overhead in terms of GPU memory being used. See below for my nvidia-smi output: | 0 N/A N/A 559025 C ...3/envs/ml/bin/python 1873MiB | | 0 N/A N/A 559026 C ...3/envs/ml/bin/python 1873MiB | | 0 N/A N/A 559027 C ...3/envs/ml/bin/python 1873MiB | | 0 N/A N/A 559028 C ...3/envs/ml/bin/python 1873MiB | | 0 N/A N/A 559029 C ...3/envs/ml/bin/python 1873MiB | | 0 N/A N/A 559030 C ...3/envs/ml/bin/python 1873MiB | | 0 N/A N/A 559031 C ...3/envs/ml/bin/python 1873MiB | | 0 N/A N/A 559032 C ...3/envs/ml/bin/python 1873MiB | | 0 N/A N/A 559033 C ...3/envs/ml/bin/python 1873MiB | So it seems like each subprocess loads the entirety of all of PyTorch into GPU memory, which seems incredibly inefficient. Is there a way to get the subprocesses to only load this once and then share it? How can I reduce the GPU footprint for each process? EDIT: Even using a basic example from the PyTorch repo I can see the same problem: Because it's not forking, it seems to be using up tons of GPU memory for each process. Can this not be fixed? submitted by /u/tmuxed [link] [comments]  ( 2 min )
    [P] ZenML: Build vendor-agnostic, production-ready MLOps pipelines
    Hello r/MachineLearning! Some here might remember we open-sourced ZenML, a year or so ago, and started building it out in the open. Today, we're re-launching it to the world, with a brand-new look, and a sharper focus. We've spoken to hundreds of ML teams in the last year, and here is what we've found: 🐘 Getting ML into production reliably is still hard today. It takes too long, is too complicated, and not enough people know how to do it. 🦡 MLOps platforms are not the answer because they are opinionated, rigid, and slow to change. It's time for MLOps frameworks to shine, and bring structure to the ML ecosystem that is ripe for standardization. 🐼 Well-thought-out abstractions that make sense and are flexible are what the industry needs. Our launch blog post, "The Framework Way is t…  ( 2 min )
    [P] Looking for advice how to apply clustering to learned embeddings of user-item interactions
    hi, I'm working on a consumer segmentation job, where the goal is to understand if there are subgroups of consumers who behave in a similar way to each other, but different from the rest of the population. My dataset contains a user interacting with an item for a period of time, with a reasonable assumption that more time spent means a more favorable view of the item (so we can take the time spent as a proxy for the user's taste). So far, I've created an ML model to learn embeddings of size 64 to represent my approximately 950k users and their interactions with the ~10k items. My original plan was to apply k-means clustering to those 64-dimensional user embeddings. However, this approach isn't yielding the degree of separation I require (e.g. the top most popular items to interact with are all the same ones for each cluster). Trying different values for k, I also get a basically entirely flat elbow graph. How should I proceed from here? I've thought of 2 options: - retraining my embeddings with a smaller size and retrying with k-means - researching an alternative clustering algorithm is there anything I haven't considered yet, but should? If no, which of these 2 approaches would you explore first, and if you prefer the latter, which algorithm(s) would you test out first? Thanks for your help! submitted by /u/the_Wallie [link] [comments]  ( 2 min )
    [D] Different input image size when using Visual Transformers
    I have an image classification problem, and have been using ResNet. The dense layers at the end are replaced with 1x1 convolutions, making the model fully convolutional. Classification is done on 128 x 128 patches, so if the input image is 128 x 128, I'll get output size 1x1. If image is 512 x 512, the output will be 4x4. Each output element will hold the prediction to which class the patch at that position belongs to. Now I'd like to try using transformer instead of ResNet. Can a similar thing be done with Vision Transformers? Are there any examples of that being done? submitted by /u/alkibijad [link] [comments]  ( 1 min )
    [P] Second-tier Recommender System in FunCorp
    Matrix decomposition is not perfect for improvement of recommendation systems. for example, you will find it hard to add gender and age of a user. In this article, we describe how we implemented a second, ranking level of the model above the collaborative one, and how two-stage recommendation systems help us to apply more complex algorithm https://medium.com/@FunCorp/putting-a-two-layered-recommendation-system-into-production-b8caaf61393d submitted by /u/Puzzleheaded_Egg_396 [link] [comments]
    [D] Does TensorFlow Lite use the DropIT method to handle intermediate tensors?
    This blog post (Optimizing TensorFlow Lite Runtime Memory) says that TensorFlow Lite employs different approaches to handle intermediate tensors which occupy large amounts of memory. Is one of them DropIT: Dropping Intermediate Tensors for Memory-Efficient DNN Training method? submitted by /u/teraRockstar [link] [comments]  ( 1 min )
    [D] ISTM results are out - First International Symposium on the Tsetlin Machine
    ​ ISTM Technical Program You find the full technical program here: https://istm.no/program/ submitted by /u/olegranmo [link] [comments]  ( 1 min )
    [P] Image Background Changer : You can change background to whatever you want.
    This project was made using rembg package that performs image segmentation with U^2-Net. https://reddit.com/link/uxathe/video/chhv2eewbk191/player submitted by /u/supercornson [link] [comments]  ( 1 min )
    [Discussion] Best Affordable way for doing ML online (Colab Pro, etc)
    I have exhausted free gcp credits and I was wondering what are the best (affordable) ways to do machine learning online. Colab Pro (10 USD per month), Kaggle, Paperspace Gradient, etc come to mind. Any thoughts on which is the best? PS: I have used Colab Free version and Kaggle before - The session timeouts that lead you to re-run the notebook from the beginning is the worst experience ever. Also, the only information I could find about Gradient was from the founder, which obviously is not very reliable. Anybody else used it? submitted by /u/BigNet1356 [link] [comments]  ( 4 min )
    [p]Serverless Model training platform for any cloud. Alternative to Sagemaker
    Hi all, We've been working on building a serverless model training platform that can run on any cloud provider. We have a beta version of the platform currently working on Aws and Azure. Looking forward to your feedback and if this is something enterprise users will use? we are at https://cloud.netbook.ai/ submitted by /u/scb_11 [link] [comments]  ( 1 min )
    [P] fastdup: tool for curating computer vision datasets at scale
    https://github.com/visualdatabase/fastdup submitted by /u/gradientflow [link] [comments]
  • Open

    Deep Learning with Label Differential Privacy
    Posted by Pasin Manurangsi and Chiyuan Zhang, Research Scientists, Google Research Over the last several years, there has been an increased focus on developing differential privacy (DP) machine learning (ML) algorithms. DP has been the basis of several practical deployments in industry — and has even been employed by the U.S. Census — because it enables the understanding of system and algorithm privacy guarantees. The underlying assumption of DP is that changing a single user’s contribution to an algorithm should not significantly change its output distribution. In the standard supervised learning setting, a model is trained to make a prediction of the label for each input given a training set of example pairs {[input1,label1], …, [inputn, labeln]}. In the case of deep learning, previ…  ( 8 min )
  • Open

    GA / Evolution noobee question
    Say I have hundred NNs of which 90 perform badly and ten perform "ok" to play a game. Now how do I seed new NNs from these ten? What "makes" a particular NN is its weights. So how do I now generate new NNs from these ten with random mutations? How do I even know that there is something like an "optimal" NN that can actually play the game? With a math background, I think there must be something like a "solution space", and it's not clear to me that a NN obtained through random mutations of the weights is even in a solution space. Thanks! submitted by /u/CantFixMoronic [link] [comments]  ( 1 min )
    Which type of learning is used in an evolutionary simulation?
    I'm creating an evolutionary simulator. I have creatures that I want to teach how to gather food. My creatures have a neural network brain, they have various input values for things such as location, proximity to food etc. They have 2 output neurons - moveX and moveY. Their brains are represented using a genetic code. Upon procreation with another creature, the genes from both parents are randomly inherited and mutations can occur. After a few thousand cycles, the creatures successfully learn how to gather food in the most efficient way possible. What type of learning would this fall into? I wouldn't say it is supervised learning since nobody is labeling the data, nor am I calculating some fitness value. Their survival is determined purely by the simulation conditions. I would guess that this falls under reinforcement learning, but I'm not sure, it might also be unsupervised learning. Can you help me find the proper terminology? submitted by /u/Log_Dogg [link] [comments]  ( 1 min )
    How to Optimize your HuggingFace Transformers
    submitted by /u/aidev2040 [link] [comments]
    A grounded deep symbolic neural network for perception
    Here is a paper titled "A Grounded Deep Symbolic Neural Network for Perception" that I am planning to submit to the NeSy 2022 workshop. It is one of a number of workshops in the Second International Conference on Learning and Reasoning (IJCLR 2022) in Windsor, UK on 28th – 30th September. Papers are due by May 31st. I would appreciate any constructive feedback you would like to make. https://www.adaptroninc.com/sites/default/files/inline-files/Grounded_Deep_Symbolic_Neural_Network_for_comments.pdf Abstract Both animals and artificial intelligent agents rely upon the identification of types of objects and events during perception. It is a categorization process, which senses, recognizes and encodes invariant features. Binary neurons (binons) are general-purpose artificial neural nodes for representing properties, objects, events and relationships between them. Non-symbolic binons are used in short-term memory to represent core sensory properties such as position, intensity and time and ones derived from them. Ratios derived from these properties are converted into invariant symbolic categories using a novel discretization algorithm based on the Weber-Fechner psychophysical laws. Symbolic binons are combined to form deep hierarchical neural networks that comprise long-term memory. It contains spatial and temporal binons representing the shape and contrast patterns for categories of objects and events. They are grounded on the core and derived properties. Empirical evidence of their successful use in classifying handwritten digits was provided by Martensen in 2013[1]. The neural networks are 100% symbolic, transparent, compositional, scalable and sparse. Learning is continuous and unsupervised. submitted by /u/BrettNMartensen [link] [comments]  ( 1 min )
  • Open

    Is diversity the key to collaboration? New AI research suggests so
    A new training approach yields artificial intelligence that adapts to diverse play-styles in a cooperative game, in what could be a win for human-AI teaming.  ( 6 min )
    Early sound exposure in the womb shapes the auditory system
    Modeling study suggests that the muffled environment in utero primes the brain’s ability to interpret some types of sound.  ( 7 min )
  • Open

    Hugging Face now allows Pull Requests and Discussions on repositories
    submitted by /u/Illustrious_Row_9971 [link] [comments]
    Microsoft announces 3 months of unlimited Codex access and free tokens for OpenAI's API
    submitted by /u/Wireless_Life [link] [comments]  ( 1 min )
    How to Optimize your HuggingFace Transformers
    submitted by /u/aidev2040 [link] [comments]
    Artificial Intelligence Implications: The Future of Digital Marketing | Hey Everyone, This is my final post in a university blog series surrounding AI, The Future, and a personal area of interest! Would love some feedback and advice for future blog-style posts!
    submitted by /u/RvZz11 [link] [comments]  ( 1 min )
    GitHub - A complete guide to start and improve in machine learning (ML), artificial intelligence (AI) in 2022 without ANY background in the field and stay up-to-date with the latest news and state-of-the-art techniques!
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 1 min )
    I got GPT-3 to write a novel
    submitted by /u/DavidKShapiro [link] [comments]  ( 1 min )
    DeepMind Researchers Develop A Machine Learning Technique For Accurate Sampling And Free-Energy Estimate Of Solid Materials Using Normalizing Flows
    A significant challenge of computational statistical mechanics is the accurate estimation of equilibrium parameters of a thermodynamic system. For decades, the methods of choice for sampling such systems at large have been molecular dynamics (MD) and hybrid Monte Carlo. Strategies for sampling probability distributions have increased, and most try leveraging normalizing flows. Normalizing Flows are a technique for creating complicated distributions that involve changing a probability density through a sequence of invertible mappings. These are desirable because of 2 characteristics: first, they can create independent samples rapidly and in parallel, and second, they can offer the precise probability density of their creation method. Training a flow-based model to approximate a target distribution yields an efficient but approximate sampler, and re-weighting the samples by their probability density can then be used to remove estimation bias for free energy estimation. The exciting part about flows is that they allow us to obtain accurate estimates even without samples from thermodynamic states. Continue Reading Paper: https://iopscience.iop.org/article/10.1088/2632-2153/ac6b16/pdf Github: https://github.com/deepmind/flows\_for\_atomic\_solids https://preview.redd.it/9z56s4iggl191.png?width=512&format=png&auto=webp&s=4417c7225923b06ae9f585b497e97f3bcbf7c0de submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    [Project]PaddleSpeech: An Easy-to-use Speech Toolkit including SOTA/Streaming ASR witch punctuation, influential TTS with text frontend and the VPR System.
    Hi, all, Glad to share an open source repository PaddleSpeech, which provides SOTA/Streaming ASR witch punctuation, influential TTS with text frontend and a product-ready VPR System. Code:https://github.com/PaddlePaddle/PaddleSpeech Features Set: 📦 Ease of Use: low barriers to install. The CLIs are available to quick-start your project. 🔬 Align to the State-of-the-Art: provide high-speed and ultra-lightweight models, and also cutting-edge technology. 🏆 Streaming ASR and TTS System: provide production ready streaming asr and streaming tts system. 💯 Rule-based frontend: the frontend contains Text Normalization and Grapheme-to-Phoneme (G2P, including Polyphone and Tone Sandhi). 🛎️ Multi-language: both English and Chinese are supported. Examples: Speech Recognition Input wav: Input.wav Output text: I knocked at the door on the ancient side of the building. Text-to-Speech Input text: Life was like a box of chocolates, you never know what you're gonna get. Output wav: Output.wav submitted by /u/Aha_IamDaniel [link] [comments]  ( 1 min )
    Watch how ai make realistic photo from anime
    submitted by /u/Due-Ad9795 [link] [comments]
    AI to track social media behavior of mass shooters
    I know next to nothing about AI, but was wondering if you could use AI to track the behavior or "search" for a pattern of behavior similar to mass shooters. Maybe a more broad question would be, could you develop AI to study human behavior and predict mental state through social media posts. submitted by /u/Lvl-Up-Candy [link] [comments]  ( 1 min )
    Last Week in AI: Autonomous cargo ships, how AI is used in Hollywood, AI to search for guns in public, and more!
    submitted by /u/regalalgorithm [link] [comments]
  • Open

    10 Best Practices For Data Science
    For quite some time now, data science has enjoyed a reputation as the next big revolution in the tech and business landscape. The number of businesses employing the applications of data science has only increased in the recent few years. According to Statista, as of 2021, nearly 60 percent of companies are housing at least… Read More »10 Best Practices For Data Science The post 10 Best Practices For Data Science appeared first on Data Science Central.  ( 7 min )
    Python Book Goodies and Apache Arrow
    In my rundown this week, I cover two distinct topics – a new Python analytics books and the rise of Apache Arrow. A New Python Data Analytics Book Published One of the best-known books on data analysis now has a new edition (3rd edition) available as an early open access The creator of Pandas library,… Read More »Python Book Goodies and Apache Arrow The post Python Book Goodies and Apache Arrow appeared first on Data Science Central.  ( 3 min )
    How Valid-Page Metadata Helps Businesses Grow
    The Internet has added a new document known as valid-page metadata that explores how the Internet processes inconsistent and invalid HTML and deals with issues of invalid mark-up. The Internet updates the help document as the title link with a new section on these headlines and troubleshoots the area.  However, most businesses are not aware… Read More »How Valid-Page Metadata Helps Businesses Grow The post How Valid-Page Metadata Helps Businesses Grow appeared first on Data Science Central.  ( 4 min )
    Why You Need an Augmented Data Integration Tool
    Introduction We are in a time when information is the core element of business success for companies in almost any industry. As technologies emerge and find large-scale adoption, there is an influx of massive amounts of data within enterprises. Two primary challenges need to be solved to obtain the necessary information. First is trustable information… Read More »Why You Need an Augmented Data Integration Tool The post Why You Need an Augmented Data Integration Tool appeared first on Data Science Central.  ( 4 min )
    Pros And Cons of AI In Manufacturing
    The fourth industrial revolution has been a game-changer, with the global economy’s expansion driving the adoption of new technologies across sectors. Manufacturers are using AI software in product design, production, supply chain, and logistics. AI analytics and data are helping in improving product quality and efficiency. Advances in machine learning, artificial intelligence (AI), and Big… Read More »Pros And Cons of AI In Manufacturing The post Pros And Cons of AI In Manufacturing appeared first on Data Science Central.  ( 5 min )
  • Open

    Harmonic e
    Douglas Hofstadter discovered that the 8th harmonic number equals e. OK, not really. The following equation cannot possibly be true because the left side is rational and the right side is irrational. However, Hofstadter showed that the equation does hold if you carry all calculations out to three decimal places. 1.000 0.500 0.333 0.250 0.200 […] Harmonic e first appeared on John D. Cook.  ( 1 min )
  • Open

    Optimality Conditions and Algorithms for Top-K Arm Identification. (arXiv:2205.12086v1 [stat.ML])
    We consider the top-k arm identification problem for multi-armed bandits with rewards belonging to a one-parameter canonical exponential family. The objective is to select the set of k arms with the highest mean rewards by sequential allocation of sampling efforts. We propose a unified optimal allocation problem that identifies the complexity measures of this problem under the fixed-confidence, fixed-budget settings, and the posterior convergence rate from the Bayesian perspective. We provide the first characterization of its optimality. We provide the first provably optimal algorithm in the fixed-confidence setting for k>1. We also propose an efficient heuristic algorithm for the top-k arm identification problem. Extensive numerical experiments demonstrate superior performance compare to existing methods in all three settings.
    Empirical Phase Diagram for Three-layer Neural Networks with Infinite Width. (arXiv:2205.12101v1 [cs.LG])
    Substantial work indicates that the dynamics of neural networks (NNs) is closely related to their initialization of parameters. Inspired by the phase diagram for two-layer ReLU NNs with infinite width (Luo et al., 2021), we make a step towards drawing a phase diagram for three-layer ReLU NNs with infinite width. First, we derive a normalized gradient flow for three-layer ReLU NNs and obtain two key independent quantities to distinguish different dynamical regimes for common initialization methods. With carefully designed experiments and a large computation cost, for both synthetic datasets and real datasets, we find that the dynamics of each layer also could be divided into a linear regime and a condensed regime, separated by a critical regime. The criteria is the relative change of input weights (the input weight of a hidden neuron consists of the weight from its input layer to the hidden neuron and its bias term) as the width approaches infinity during the training, which tends to $0$, $+\infty$ and $O(1)$, respectively. In addition, we also demonstrate that different layers can lie in different dynamical regimes in a training process within a deep NN. In the condensed regime, we also observe the condensation of weights in isolated orientations with low complexity. Through experiments under three-layer condition, our phase diagram suggests a complicated dynamical regimes consisting of three possible regimes, together with their mixture, for deep NNs and provides a guidance for studying deep NNs in different initialization regimes, which reveals the possibility of completely different dynamics emerging within a deep NN for its different layers.
    Graph Convolutional Reinforcement Learning for Collaborative Queuing Agents. (arXiv:2205.12009v1 [cs.NI])
    In this paper, we explore the use of multi-agent deep learning as well as learning to cooperate principles to meet stringent service level agreements, in terms of throughput and end-to-end delay, for a set of classified network flows. We consider agents built on top of a weighted fair queuing algorithm that continuously set weights for three flow groups: gold, silver, and bronze. We rely on a novel graph-convolution based, multi-agent reinforcement learning approach known as DGN. As benchmarks, we propose centralized and distributed deep Q-network approaches and evaluate their performances in different network, traffic, and routing scenarios, highlighting the effectiveness of our proposals and the importance of agent cooperation. We show that our DGN-based approach meets stringent throughput and delay requirements across all scenarios.
    DNNAbacus: Toward Accurate Computational Cost Prediction for Deep Neural Networks. (arXiv:2205.12095v1 [cs.LG])
    Deep learning is attracting interest across a variety of domains, including natural language processing, speech recognition, and computer vision. However, model training is time-consuming and requires huge computational resources. Existing works on the performance prediction of deep neural networks, which mostly focus on the training time prediction of a few models, rely on analytical models and result in high relative errors. %Optimizing task scheduling and reducing job failures in data centers are essential to improve resource utilization and reduce carbon emissions. This paper investigates the computational resource demands of 29 classical deep neural networks and builds accurate models for predicting computational costs. We first analyze the profiling results of typical networks and demonstrate that the computational resource demands of models with different inputs and hyperparameters are not obvious and intuitive. We then propose a lightweight prediction approach DNNAbacus with a novel network structural matrix for network representation. DNNAbacus can accurately predict both memory and time cost for PyTorch and TensorFlow models, which is also generalized to different hardware architectures and can have zero-shot capability for unseen networks. Our experimental results show that the mean relative error (MRE) is 0.9% with respect to time and 2.8% with respect to memory for 29 classic models, which is much lower than the state-of-the-art works.
    Training Efficient CNNS: Tweaking the Nuts and Bolts of Neural Networks for Lighter, Faster and Robust Models. (arXiv:2205.12050v1 [cs.LG])
    Deep Learning has revolutionized the fields of computer vision, natural language understanding, speech recognition, information retrieval and more. Many techniques have evolved over the past decade that made models lighter, faster, and robust with better generalization. However, many deep learning practitioners persist with pre-trained models and architectures trained mostly on standard datasets such as Imagenet, MS-COCO, IMDB-Wiki Dataset, and Kinetics-700 and are either hesitant or unaware of redesigning the architecture from scratch that will lead to better performance. This scenario leads to inefficient models that are not suitable on various devices such as mobile, edge, and fog. In addition, these conventional training methods are of concern as they consume a lot of computing power. In this paper, we revisit various SOTA techniques that deal with architecture efficiency (Global Average Pooling, depth-wise convolutions & squeeze and excitation, Blurpool), learning rate (Cyclical Learning Rate), data augmentation (Mixup, Cutout), label manipulation (label smoothing), weight space manipulation (stochastic weight averaging), and optimizer (sharpness aware minimization). We demonstrate how an efficient deep convolution network can be built in a phased manner by sequentially reducing the number of training parameters and using the techniques mentioned above. We achieved a SOTA accuracy of 99.2% on MNIST data with just 1500 parameters and an accuracy of 86.01% with just over 140K parameters on the CIFAR-10 dataset.
    On statistic alignment for domain adaptation in structural health monitoring. (arXiv:2205.12052v1 [cs.LG])
    The practical application of structural health monitoring (SHM) is often limited by the availability of labelled data. Transfer learning - specifically in the form of domain adaptation (DA) - gives rise to the possibility of leveraging information from a population of physical or numerical structures, by inferring a mapping that aligns the feature spaces. Typical DA methods rely on nonparametric distance metrics, which require sufficient data to perform density estimation. In addition, these methods can be prone to performance degradation under class imbalance. To address these issues, statistic alignment (SA) is discussed, with a demonstration of how these methods can be made robust to class imbalance, including a special case of class imbalance called a partial DA scenario. SA is demonstrated to facilitate damage localisation with no target labels in a numerical case study, outperforming other state-of-the-art DA methods. It is then shown to be capable of aligning the feature spaces of a real heterogeneous population, the Z24 and KW51 bridges, with only 220 samples used from the KW51 bridge. Finally, in scenarios where more complex mappings are required for knowledge transfer, SA is shown to be a vital pre-processing tool, increasing the performance of established DA methods.
    Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation. (arXiv:2205.11140v2 [cs.LG] UPDATED)
    We study human-in-the-loop reinforcement learning (RL) with trajectory preferences, where instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer. The goal of the agent is to learn the optimal policy which is most preferred by the human overseer. Despite the empirical successes, the theoretical understanding of preference-based RL (PbRL) is only limited to the tabular case. In this paper, we propose the first optimistic model-based algorithm for PbRL with general function approximation, which estimates the model using value-targeted regression and calculates the exploratory policies by solving an optimistic planning problem. Our algorithm achieves the regret of $\tilde{O} (\operatorname{poly}(d H) \sqrt{K} )$, where $d$ is the complexity measure of the transition and preference model depending on the Eluder dimension and log-covering numbers, $H$ is the planning horizon, $K$ is the number of episodes, and $\tilde O(\cdot)$ omits logarithmic terms. Our lower bound indicates that our algorithm is near-optimal when specialized to the linear setting. Furthermore, we extend the PbRL problem by formulating a novel problem called RL with $n$-wise comparisons, and provide the first sample-efficient algorithm for this new setting. To the best of our knowledge, this is the first theoretical result for PbRL with (general) function approximation.
    Logarithmic regret bounds for continuous-time average-reward Markov decision processes. (arXiv:2205.11168v2 [cs.LG] UPDATED)
    We consider reinforcement learning for continuous-time Markov decision processes (MDPs) in the infinite-horizon, average-reward setting. In contrast to discrete-time MDPs, a continuous-time process moves to a state and stays there for a random holding time after an action is taken. With unknown transition probabilities and rates of exponential holding times, we derive instance-dependent regret lower bounds that are logarithmic in the time horizon. Moreover, we design a learning algorithm and establish a finite-time regret bound that achieves the logarithmic growth rate. Our analysis builds upon upper confidence reinforcement learning, a delicate estimation of the mean holding times, and stochastic comparison of point processes.
    Deep Neural Network approaches for Analysing Videos of Music Performances. (arXiv:2205.11232v2 [cs.CV] UPDATED)
    This paper presents a framework to automate the labelling process for gestures in musical performance videos with a 3D Convolutional Neural Network (CNN). While this idea was proposed in a previous study, this paper introduces several novelties: (i) Presents a novel method to overcome the class imbalance challenge and make learning possible for co-existent gestures by batch balancing approach and spatial-temporal representations of gestures. (ii) Performs a detailed study on 7 and 18 categories of gestures generated during the performance (guitar play) of musical pieces that have been video-recorded. (iii) Investigates the possibility to use audio features. (iv) Extends the analysis to multiple videos. The novel methods significantly improve the performance of gesture identification by 12 %, when compared to the previous work (51 % in this study over 39 % in previous work). We successfully validate the proposed methods on 7 super classes (72 %), an ensemble of the 18 gestures/classes, and additional videos (75 %).
    GraphMAE: Self-Supervised Masked Graph Autoencoders. (arXiv:2205.10803v2 [cs.LG] UPDATED)
    Self-supervised learning (SSL) has been extensively explored in recent years. Particularly, generative SSL has seen emerging success in natural language processing and other fields, such as the wide adoption of BERT and GPT. Despite this, contrastive learning--which heavily relies on structural data augmentation and complicated training strategies--has been the dominant approach in graph SSL, while the progress of generative SSL on graphs, especially graph autoencoders (GAEs), has thus far not reached the potential as promised in other fields. In this paper, we identify and examine the issues that negatively impact the development of GAEs, including their reconstruction objective, training robustness, and error metric. We present a masked graph autoencoder GraphMAE that mitigates these issues for generative self-supervised graph learning. Instead of reconstructing structures, we propose to focus on feature reconstruction with both a masking strategy and scaled cosine error that benefit the robust training of GraphMAE. We conduct extensive experiments on 21 public datasets for three different graph learning tasks. The results manifest that GraphMAE--a simple graph autoencoder with our careful designs--can consistently generate outperformance over both contrastive and generative state-of-the-art baselines. This study provides an understanding of graph autoencoders and demonstrates the potential of generative self-supervised learning on graphs.
    Bayesian Active Meta-Learning for Black-Box Optimization. (arXiv:2110.09943v2 [cs.LG] CROSS LISTED)
    Data-efficient learning algorithms are essential in many practical applications for which data collection is expensive, e.g., for the optimal deployment of wireless systems in unknown propagation scenarios. Meta-learning can address this problem by leveraging data from a set of related learning tasks, e.g., from similar deployment settings. In practice, one may have available only unlabeled data sets from the related tasks, requiring a costly labeling procedure to be carried out before use in meta-learning. For instance, one may know the possible positions of base stations in a given area, but not the performance indicators achievable with each deployment. To decrease the number of labeling steps required for meta-learning, this paper introduces an information-theoretic active task selection mechanism, and evaluates an instantiation of the approach for Bayesian optimization of black-box models.
    Stack operation of tensor networks. (arXiv:2203.16338v2 [cs.LG] UPDATED)
    The tensor network, as a facterization of tensors, aims at performing the operations that are common for normal tensors, such as addition, contraction and stacking. However, due to its non-unique network structure, only the tensor network contraction is so far well defined. In this paper, we propose a mathematically rigorous definition for the tensor network stack approach, that compress a large amount of tensor networks into a single one without changing their structures and configurations. We illustrate the main ideas with the matrix product states based machine learning as an example. Our results are compared with the for loop and the efficient coding method on both CPU and GPU.
    Towards Practical Physics-Informed ML Design and Evaluation for Power Grid. (arXiv:2205.03673v2 [cs.LG] UPDATED)
    When applied to a real-world safety critical system like the power grid, general machine learning methods suffer from expensive training, non-physical solutions, and limited interpretability. To address these challenges for power grids, many recent works have explored the inclusion of grid physics (i.e., domain expertise) into their method design, primarily through including system constraints and technical limits, reducing search space and defining meaningful features in latent space. Yet, there is no general methodology to evaluate the practicality of these approaches in power grid tasks, and limitations exist regarding scalability, generalization, interpretability, etc. This work formalizes a new concept of physical interpretability which assesses how a ML model makes predictions in a physically meaningful way and introduces an evaluation methodology that identifies a set of attributes that a practical method should satisfy. Inspired by the evaluation attributes, the paper further develops a novel contingency analysis warm starter for MadIoT cyberattack, based on a conditional Gaussian random field. This method serves as an instance of an ML model that can incorporate diverse domain knowledge and improve on these identified attributes. Experiments validate that the warm starter significantly boosts the efficiency of contingency analysis for MadIoT attack even with shallow NN architectures.
    On Understanding and Mitigating the Dimensional Collapse of Graph Contrastive Learning: a Non-Maximum Removal Approach. (arXiv:2203.12821v2 [cs.LG] UPDATED)
    Graph Contrastive Learning (GCL) has shown promising performance in graph representation learning (GRL) without the supervision of manual annotations. GCL can generate graph-level embeddings by maximizing the Mutual Information (MI) between different augmented views of the same graph (positive pairs). However, the GCL is limited by dimensional collapse, i.e., embedding vectors only occupy a low-dimensional subspace. In this paper, we show that the smoothing effect of the graph pooling and the implicit regularization of the graph convolution are two causes of the dimensional collapse in GCL. To mitigate the above issue, we propose a non-maximum removal graph contrastive learning approach (nmrGCL), which removes "prominent'' dimensions (i.e., contribute most in similarity measurement) for positive pair in the pre-text task. Comprehensive experiments on various benchmark datasets are conducted to demonstrate the effectiveness of nmrGCL, and the results show that our model outperforms the state-of-the-art methods. Source code will be made publicly available.
    Unsupervised Ranking and Aggregation of Label Descriptions for Zero-Shot Classifiers. (arXiv:2204.09481v2 [cs.CL] UPDATED)
    Zero-shot text classifiers based on label descriptions embed an input text and a set of labels into the same space: measures such as cosine similarity can then be used to select the most similar label description to the input text as the predicted label. In a true zero-shot setup, designing good label descriptions is challenging because no development set is available. Inspired by the literature on Learning with Disagreements, we look at how probabilistic models of repeated rating analysis can be used for selecting the best label descriptions in an unsupervised fashion. We evaluate our method on a set of diverse datasets and tasks (sentiment, topic and stance). Furthermore, we show that multiple, noisy label descriptions can be aggregated to boost the performance.
    Pessimistic Minimax Value Iteration: Provably Efficient Equilibrium Learning from Offline Datasets. (arXiv:2202.07511v2 [cs.LG] UPDATED)
    We study episodic two-player zero-sum Markov games (MGs) in the offline setting, where the goal is to find an approximate Nash equilibrium (NE) policy pair based on a dataset collected a priori. When the dataset does not have uniform coverage over all policy pairs, finding an approximate NE involves challenges in three aspects: (i) distributional shift between the behavior policy and the optimal policy, (ii) function approximation to handle large state space, and (iii) minimax optimization for equilibrium solving. We propose a pessimism-based algorithm, dubbed as pessimistic minimax value iteration (PMVI), which overcomes the distributional shift by constructing pessimistic estimates of the value functions for both players and outputs a policy pair by solving NEs based on the two value functions. Furthermore, we establish a data-dependent upper bound on the suboptimality which recovers a sublinear rate without the assumption on uniform coverage of the dataset. We also prove an information-theoretical lower bound, which suggests that the data-dependent term in the upper bound is intrinsic. Our theoretical results also highlight a notion of "relative uncertainty", which characterizes the necessary and sufficient condition for achieving sample efficiency in offline MGs. To the best of our knowledge, we provide the first nearly minimax optimal result for offline MGs with function approximation.
    Rethinking Attention-Model Explainability through Faithfulness Violation Test. (arXiv:2201.12114v2 [cs.LG] UPDATED)
    Attention mechanisms are dominating the explainability of deep models. They produce probability distributions over the input, which are widely deemed as feature-importance indicators. However, in this paper, we find one critical limitation in attention explanations: weakness in identifying the polarity of feature impact. This would be somehow misleading -- features with higher attention weights may not faithfully contribute to model predictions; instead, they can impose suppression effects. With this finding, we reflect on the explainability of current attention-based techniques, such as Attentio$\odot$Gradient and LRP-based attention explanations. We first propose an actionable diagnostic methodology (henceforth faithfulness violation test) to measure the consistency between explanation weights and the impact polarity. Through the extensive experiments, we then show that most tested explanation methods are unexpectedly hindered by the faithfulness violation issue, especially the raw attention. Empirical analyses on the factors affecting violation issues further provide useful observations for adopting explanation methods in attention models.
    Retrieval-Augmented Reinforcement Learning. (arXiv:2202.08417v4 [cs.LG] UPDATED)
    Most deep reinforcement learning (RL) algorithms distill experience into parametric behavior policies or value functions via gradient updates. While effective, this approach has several disadvantages: (1) it is computationally expensive, (2) it can take many updates to integrate experiences into the parametric model, (3) experiences that are not fully integrated do not appropriately influence the agent's behavior, and (4) behavior is limited by the capacity of the model. In this paper we explore an alternative paradigm in which we train a network to map a dataset of past experiences to optimal behavior. Specifically, we augment an RL agent with a retrieval process (parameterized as a neural network) that has direct access to a dataset of experiences. This dataset can come from the agent's past experiences, expert demonstrations, or any other relevant source. The retrieval process is trained to retrieve information from the dataset that may be useful in the current context, to help the agent achieve its goal faster and more efficiently. he proposed method facilitates learning agents that at test-time can condition their behavior on the entire dataset and not only the current state, or current trajectory. We integrate our method into two different RL agents: an offline DQN agent and an online R2D2 agent. In offline multi-task problems, we show that the retrieval-augmented DQN agent avoids task interference and learns faster than the baseline DQN agent. On Atari, we show that retrieval-augmented R2D2 learns significantly faster than the baseline R2D2 agent and achieves higher scores. We run extensive ablations to measure the contributions of the components of our proposed method.
    Logical Fallacy Detection. (arXiv:2202.13758v2 [cs.CL] UPDATED)
    Reasoning is central to human intelligence. However, fallacious arguments are common, and some exacerbate problems such as spreading misinformation about climate change. In this paper, we propose the task of logical fallacy detection, and provide a new dataset (Logic) of logical fallacies generally found in text, together with an additional challenge set for detecting logical fallacies in climate change claims (LogicClimate). Detecting logical fallacies is a hard problem as the model must understand the underlying logical structure of the argument. We find that existing pretrained large language models perform poorly on this task. In contrast, we show that a simple structure-aware classifier outperforms the best language model by 5.46% on Logic and 4.51% on LogicClimate. We encourage future work to explore this task as (a) it can serve as a new reasoning challenge for language models, and (b) it can have potential applications in tackling the spread of misinformation. Our dataset and code are available at https://github.com/causalNLP/logical-fallacy.
    MetricGAN+/-: Increasing Robustness of Noise Reduction on Unseen Data. (arXiv:2203.12369v4 [cs.SD] UPDATED)
    Training of speech enhancement systems often does not incorporate knowledge of human perception and thus can lead to unnatural sounding results. Incorporating psychoacoustically motivated speech perception metrics as part of model training via a predictor network has recently gained interest. However, the performance of such predictors is limited by the distribution of metric scores that appear in the training data. In this work, we propose MetricGAN+/- (an extension of MetricGAN+, one such metric-motivated system) which introduces an additional network - a "de-generator" which attempts to improve the robustness of the prediction network (and by extension of the generator) by ensuring observation of a wider range of metric scores in training. Experimental results on the VoiceBank-DEMAND dataset show relative improvement in PESQ score of 3.8% (3.05 vs 3.22 PESQ score), as well as better generalisation to unseen noise and speech.
    Convolutional Neural Networks on Graphs with Chebyshev Approximation, Revisited. (arXiv:2202.03580v3 [cs.LG] UPDATED)
    Designing spectral convolutional networks is a challenging problem in graph learning. ChebNet, one of the early attempts, approximates the spectral graph convolutions using Chebyshev polynomials. GCN simplifies ChebNet by utilizing only the first two Chebyshev polynomials while still outperforming it on real-world datasets. GPR-GNN and BernNet demonstrate that the Monomial and Bernstein bases also outperform the Chebyshev basis in terms of learning the spectral graph convolutions. Such conclusions are counter-intuitive in the field of approximation theory, where it is established that the Chebyshev polynomial achieves the optimum convergent rate for approximating a function. In this paper, we revisit the problem of approximating the spectral graph convolutions with Chebyshev polynomials. We show that ChebNet's inferior performance is primarily due to illegal coefficients learnt by ChebNet approximating analytic filter functions, which leads to over-fitting. We then propose ChebNetII, a new GNN model based on Chebyshev interpolation, which enhances the original Chebyshev polynomial approximation while reducing the Runge phenomenon. We conducted an extensive experimental study to demonstrate that ChebNetII can learn arbitrary graph convolutions and achieve superior performance in both full- and semi-supervised node classification tasks. Most notably, we scale ChebNetII to a billion graph ogbn-papers100M, showing that spectral-based GNNs have superior performance.
    Random Feature Amplification: Feature Learning and Generalization in Neural Networks. (arXiv:2202.07626v2 [cs.LG] UPDATED)
    In this work, we provide a characterization of the feature-learning process in two-layer ReLU networks trained by gradient descent on the logistic loss following random initialization. We consider data with binary labels that are generated by an XOR-like function of the input features. We permit a constant fraction of the training labels to be corrupted by an adversary. We show that, although linear classifiers are no better than random guessing for the distribution we consider, two-layer ReLU networks trained by gradient descent achieve generalization error close to the label noise rate. We develop a novel proof technique that shows that at initialization, the vast majority of neurons function as random features that are only weakly correlated with useful features, and the gradient descent dynamics 'amplify' these weak, random features to strong, useful features.
    Training Differentially Private Models with Secure Multiparty Computation. (arXiv:2202.02625v2 [cs.CR] UPDATED)
    We address the problem of learning a machine learning model from training data that originates at multiple data owners while providing formal privacy guarantees regarding the protection of each owner's data. Existing solutions based on Differential Privacy (DP) achieve this at the cost of a drop in accuracy. Solutions based on Secure Multiparty Computation (MPC) do not incur such accuracy loss but leak information when the trained model is made publicly available. We propose an MPC solution for training DP models. Our solution relies on an MPC protocol for model training, and an MPC protocol for perturbing the trained model coefficients with Laplace noise in a privacy-preserving manner. The resulting MPC+DP approach achieves higher accuracy than a pure DP approach while providing the same formal privacy guarantees. Our work obtained first place in the iDASH2021 Track III competition on confidential computing for secure genome analysis.
    PrivFair: a Library for Privacy-Preserving Fairness Auditing. (arXiv:2202.04058v3 [cs.LG] UPDATED)
    Machine learning (ML) has become prominent in applications that directly affect people's quality of life, including in healthcare, justice, and finance. ML models have been found to exhibit discrimination based on sensitive attributes such as gender, race, or disability. Assessing if an ML model is free of bias remains challenging to date, and by definition has to be done with sensitive user characteristics that are subject of anti-discrimination and data protection law. Existing libraries for fairness auditing of ML models offer no mechanism to protect the privacy of the audit data. We present PrivFair, a library for privacy-preserving fairness audits of ML models. Through the use of Secure Multiparty Computation (MPC), PrivFair protects the confidentiality of the model under audit and the sensitive data used for the audit, hence it supports scenarios in which a proprietary classifier owned by a company is audited using sensitive audit data from an external investigator. We demonstrate the use of PrivFair for group fairness auditing with tabular data or image data, without requiring the investigator to disclose their data to anyone in an unencrypted manner, or the model owner to reveal their model parameters to anyone in plaintext.
    On the identifiability of mixtures of ranking models. (arXiv:2201.13132v2 [cs.LG] UPDATED)
    Mixtures of ranking models are standard tools for ranking problems. However, even the fundamental question of parameter identifiability is not fully understood: the identifiability of a mixture model with two Bradley-Terry-Luce (BTL) components has remained open. In this work, we show that popular mixtures of ranking models with two components (BTL, multinomial logistic models with slates of size 3, or Plackett-Luce) are generically identifiable, i.e., the ground-truth parameters can be identified except when they are from a pathological subset of measure zero. We provide a framework for verifying the number of solutions in a general family of polynomial systems using algebraic geometry, and apply it to these mixtures of ranking models to establish generic identifiability. The framework can be applied more broadly to other learning models and may be of independent interest.
    Identifying Dementia Subtypes with Electronic Health Records. (arXiv:2202.00009v2 [cs.LG] UPDATED)
    Dementia is characterized by a decline in memory and thinking that is significant enough to impair function in activities of daily living. Patients seen in dementia specialty clinics are highly heterogeneous with a variety of different symptoms that progress at different rates. In this work, we used an unsupervised data-driven K-Means clustering approach on the component scores of the Clinical Dementia Rating (CDR) score to identify dementia subtypes and used the gap-statistic to identify the optimal number of clusters. Our goal was to characterize the identified dementia subtypes in terms of their cognitive performance and analyze how patient transitions between subtypes relate to disease progression. Our results indicate both inter-subtype variability, which indicates the variability amongst dementia subtypes for a particular component score even with the same CDR and (ii) intra-subtype variability, which indicates the variation in the 6 component scores within a particular dementia subtype. We observed that dementia subtypes that represented individuals with very mild dementia (CDR 0.5) had widely varying rates of transition to other subtypes. Future work includes testing the generalizability of our proposed pipeline on additional datasets, and using a larger volume of EHR data to estimate probabilistic estimates of the variability between dementia subtypes both in terms of cognitive profile and disease progression.
    Efficient Strong Scaling Through Burst Parallel Training. (arXiv:2112.10065v3 [cs.DC] UPDATED)
    As emerging deep neural network (DNN) models continue to grow in size, using large GPU clusters to train DNNs is becoming an essential requirement to achieving acceptable training times. In this paper, we consider the case where future increases in cluster size will cause the global batch size that can be used to train models to reach a fundamental limit: beyond a certain point, larger global batch sizes cause sample efficiency to degrade, increasing overall time to accuracy. As a result, to achieve further improvements in training performance, we must instead consider "strong scaling" strategies that hold the global batch size constant and allocate smaller batches to each GPU. Unfortunately, this makes it significantly more difficult to use cluster resources efficiently. We present DeepPool, a system that addresses this efficiency challenge through two key ideas. First, burst parallelism allocates large numbers of GPUs to foreground jobs in bursts to exploit the unevenness in parallelism across layers. Second, GPU multiplexing prioritizes throughput for foreground training jobs, while packing in background training jobs to reclaim underutilized GPU resources, thereby improving cluster-wide utilization. Together, these two ideas enable DeepPool to deliver a 1.2 - 2.3x improvement in total cluster throughput over standard data parallelism with a single task when the cluster scale is large.
    Combining optimal path search with task-dependent learning in a neural network. (arXiv:2201.11104v3 [cs.LG] UPDATED)
    Finding optimal paths in connected graphs requires determining the smallest total cost for traveling along the graph's edges. This problem can be solved by several classical algorithms where, usually, costs are predefined for all edges. Conventional planning methods can, thus, normally not be used when wanting to change costs in an adaptive way following the requirements of some task. Here we show that one can define a neural network representation of path finding problems by transforming cost values into synaptic weights, which allows for online weight adaptation using network learning mechanisms. When starting with an initial activity value of one, activity propagation in this network will lead to solutions, which are identical to those found by the Bellman Ford algorithm. The neural network has the same algorithmic complexity as Bellman Ford and, in addition, we can show that network learning mechanisms (such as Hebbian learning) can adapt the weights in the network augmenting the resulting paths according to some task at hand. We demonstrate this by learning to navigate in an environment with obstacles as well as by learning to follow certain sequences of path nodes. Hence, the here-presented novel algorithm may open up a different regime of applications where path-augmentation (by learning) is directly coupled with path finding in a natural way.
    Balanced Graph Structure Learning for Multivariate Time Series Forecasting. (arXiv:2201.09686v2 [cs.LG] UPDATED)
    Accurate forecasting of multivariate time series is an extensively studied subject in finance, transportation, and computer science. Fully mining the correlation and causation between the variables in a multivariate time series exhibits noticeable results in improving the performance of a time series model. Recently, some models have explored the dependencies between variables through end-to-end graph structure learning without the need for predefined graphs. However, current models do not incorporate the trade-off between efficiency and flexibility and lack the guidance of domain knowledge in the design of graph structure learning algorithms. This paper alleviates the above issues by proposing Balanced Graph Structure Learning for Forecasting (BGSLF), a novel deep learning model that joins graph structure learning and forecasting. Technically, BGSLF leverages the spatial information into convolutional operations and extracts temporal dynamics using the diffusion convolutional recurrent network. The proposed framework balance the trade-off between efficiency and flexibility by introducing Multi-Graph Generation Network (MGN) and Graph Selection Module. In addition, a method named Smooth Sparse Unit (SSU) is designed to sparse the learned graph structures, which conforms to the sparse spatial correlations in the real world. Extensive experiments on four real-world datasets demonstrate that our model achieves state-of-the-art performances with minor trainable parameters. Code will be made publicly available.
    Stochastic Neural Networks with Infinite Width are Deterministic. (arXiv:2201.12724v2 [cs.LG] UPDATED)
    This work theoretically studies stochastic neural networks, a main type of neural network in use. We prove that as the width of an optimized stochastic neural network tends to infinity, its predictive variance on the training set decreases to zero. Our theory justifies the common intuition that adding stochasticity to the model can help regularize the model by introducing an averaging effect. Two common examples that our theory can be relevant to are neural networks with dropout and Bayesian latent variable models in a special limit. Our result thus helps better understand how stochasticity affects the learning of neural networks and potentially design better architectures for practical problems.
    PowerGraph: Using neural networks and principal components to determine multivariate statistical power trade-offs. (arXiv:2201.00719v2 [stat.ME] UPDATED)
    Statistical power estimation for studies with multiple model parameters is inherently a multivariate problem. Power for individual parameters of interest cannot be reliably estimated univariately since correlation and variance explained relative to one parameter will impact the power for another parameter, all usual univariate considerations being equal. Explicit solutions in such cases, especially for models with many parameters, are either impractical or impossible to solve, leaving researchers to the prevailing method of simulating power. However, the point estimates for a vector of model parameters are uncertain, and the impact of inaccuracy is unknown. In such cases, sensitivity analysis is recommended such that multiple combinations of possible observable parameter vectors are simulated to understand power trade-offs. A limitation to this approach is that it is computationally expensive to generate sufficient sensitivity combinations to accurately map the power trade-off function in increasingly high-dimensional spaces for the models that social scientists estimate. This paper explores the efficient estimation and graphing of statistical power for a study over varying model parameter combinations. We propose a simple and generalizable machine learning inspired solution to cut the computational cost to less than 10% of the brute force method while providing F1 scores above 90%. We further motivate the impact of transfer learning in learning power manifolds across varying distributions.
    Bayesian Calibration of imperfect computer models using Physics-informed priors. (arXiv:2201.06463v2 [stat.ML] UPDATED)
    We introduce a computational efficient data-driven framework suitable for quantifying the uncertainty in physical parameters and model formulation of computer models, represented by differential equations. We construct physics-informed priors, which are multi-output GP priors that encode the model's structure in the covariance function. We extend this into a fully Bayesian framework that quantifies the uncertainty of physical parameters and model predictions. Since physical models often are imperfect descriptions of the real process, we allow the model to deviate from the observed data by considering a discrepancy function. To obtain the posterior distributions, we use Hamiltonian Monte Carlo sampling. We demonstrate our approach in a simulation study with hemodynamical models, which are time-dependent differential equations. Data are simulated from a more complex model than our modelling choice, and the aim is to learn physical parameters according to known mathematical connections. To demonstrate the flexibility of our approach, an example using the Heat equation, a space-time dependent differential equation where we consider a case of a biased data-acquisition process is also included. Finally, we fit the hemodynamic model using real data obtained in a medical trial.
    Swift and Sure: Hardness-aware Contrastive Learning for Low-dimensional Knowledge Graph Embeddings. (arXiv:2201.00565v2 [cs.LG] UPDATED)
    Knowledge graph embedding (KGE) has shown great potential in automatic knowledge graph (KG) completion and knowledge-driven tasks. However, recent KGE models suffer from high training cost and large storage space, thus limiting their practicality in real-world applications. To address this challenge, based on the latest findings in the field of Contrastive Learning, we propose a novel KGE training framework called Hardness-aware Low-dimensional Embedding (HaLE). Instead of the traditional Negative Sampling, we design a new loss function based on query sampling that can balance two important training targets, Alignment and Uniformity. Furthermore, we analyze the hardness-aware ability of recent low-dimensional hyperbolic models and propose a lightweight hardness-aware activation mechanism. The experimental results show that in the limited training time, HaLE can effectively improve the performance and training speed of KGE models on five commonly-used datasets. After training just a few minutes, the HaLE-trained models are competitive compared to the state-of-the-art models in both low- and high-dimensional conditions.
    OstrichRL: A Musculoskeletal Ostrich Simulation to Study Bio-mechanical Locomotion. (arXiv:2112.06061v2 [cs.RO] UPDATED)
    Muscle-actuated control is a research topic that spans multiple domains, including biomechanics, neuroscience, reinforcement learning, robotics, and graphics. This type of control is particularly challenging as bodies are often overactuated and dynamics are delayed and non-linear. It is however a very well tested and tuned actuation mechanism that has undergone millions of years of evolution with interesting properties exploiting passive forces and efficient energy storage of muscle-tendon units. To facilitate research on muscle-actuated simulation, we release a 3D musculoskeletal simulation of an ostrich based on the MuJoCo physics engine. The ostrich is one of the fastest bipeds on earth and therefore makes an excellent model for studying muscle-actuated bipedal locomotion. The model is based on CT scans and dissections used to collect actual muscle data, such as insertion sites, lengths, and pennation angles. Along with this model, we also provide a set of reinforcement learning tasks, including reference motion tracking, running, and neck control, used to infer muscle actuation patterns. The reference motion data is based on motion capture clips of various behaviors that we preprocessed and adapted to our model. This paper describes how the model was built and iteratively improved using the tasks. We also evaluate the accuracy of the muscle actuation patterns by comparing them to experimentally collected electromyographic data from locomoting birds. The results demonstrate the need for rich reward signals or regularization techniques to constrain muscle excitations and produce realistic movements. Overall, we believe that this work can provide a useful bridge between fields of research interested in muscle actuation.
    A Review of Indoor Millimeter Wave Device-based Localization and Device-free Sensing Technologies and Applications. (arXiv:2112.05593v2 [cs.NI] UPDATED)
    The commercial availability of low-cost millimeter wave (mmWave) communication and radar devices is starting to improve the penetration of such technologies in consumer markets, paving the way for large-scale and dense deployments in fifth-generation (5G)-and-beyond as well as 6G networks. At the same time, pervasive mmWave access will enable device localization and device-free sensing with unprecedented accuracy, especially with respect to sub-6 GHz commercial-grade devices. This paper surveys the state of the art in device-based localization and device-free sensing using mmWave communication and radar devices, with a focus on indoor deployments. We first overview key concepts about mmWave signal propagation and system design. Then, we provide a detailed account of approaches and algorithms for localization and sensing enabled by mmWaves. We consider several dimensions in our analysis, including the main objectives, techniques, and performance of each work, whether each research reached some degree of implementation, and which hardware platforms were used for this purpose. We conclude by discussing that better algorithms for consumer-grade devices, data fusion methods for dense deployments, as well as an educated application of machine learning methods are promising, relevant and timely research directions.
    Spatio-Temporal Modeling for Flash Memory Channels Using Conditional Generative Nets. (arXiv:2111.10039v2 [eess.SY] UPDATED)
    We propose a data-driven approach to modeling the spatio-temporal characteristics of NAND flash memory read voltages using conditional generative networks. The learned model reconstructs read voltages from an individual memory cell based on the program levels of the cell and its surrounding cells, as well as the specified program/erase (P/E) cycling time stamp. We evaluate the model over a range of time stamps using the cell read voltage distributions, the cell level error rates, and the relative frequency of errors for patterns most susceptible to inter-cell interference (ICI) effects. We conclude that the model accurately captures the spatial and temporal features of the flash memory channel.
    Ensemble of Averages: Improving Model Selection and Boosting Performance in Domain Generalization. (arXiv:2110.10832v3 [cs.LG] UPDATED)
    In Domain Generalization (DG) settings, models trained independently on a given set of training domains have notoriously chaotic performance on distribution shifted test domains, and stochasticity in optimization (e.g. seed) plays a big role. This makes deep learning models unreliable in real world settings. We first show that this chaotic behavior exists even along the training optimization trajectory of a single model, and propose a simple model averaging protocol that both significantly boosts domain generalization and diminishes the impact of stochasticity by improving the rank correlation between the in-domain validation accuracy and out-domain test accuracy, which is crucial for reliable early stopping. Taking advantage of our observation, we show that instead of ensembling unaveraged models (that is typical in practice), ensembling moving average models (EoA) from independent runs further boosts performance. We theoretically explain the boost in performance of ensembling and model averaging by adapting the well known Bias-Variance trade-off to the domain generalization setting. On the DomainBed benchmark, when using a pre-trained ResNet-50, this ensemble of averages achieves an average of $68.0\%$, beating vanilla ERM (w/o averaging/ensembling) by $\sim 4\%$, and when using a pre-trained RegNetY-16GF, achieves an average of $76.6\%$, beating vanilla ERM by $6\%$. Our code is available at \url{https://github.com/salesforce/ensemble-of-averages}.
    DIGRAC: Digraph Clustering Based on Flow Imbalance. (arXiv:2106.05194v6 [stat.ML] UPDATED)
    Node clustering is a powerful tool in the analysis of networks. We introduce a graph neural network framework to obtain node embeddings for directed networks in a self-supervised manner, including a novel probabilistic imbalance loss, which can be used for network clustering. Here, we propose directed flow imbalance measures, which are tightly related to directionality, to reveal clusters in the network even when there is no density difference between clusters. In contrast to standard approaches in the literature, in this paper, directionality is not treated as a nuisance, but rather contains the main signal. DIGRAC optimizes directed flow imbalance for clustering without requiring label supervision, unlike existing graph neural network methods, and can naturally incorporate node features, unlike existing spectral methods. Extensive experimental results on synthetic data, in the form of directed stochastic block models, and real-world data at different scales, demonstrate that our method, based on flow imbalance, attains state-of-the-art results on directed graph clustering when compared against 10 state-of-the-art methods from the literature, for a wide range of noise and sparsity levels, graph structures and topologies, and even outperforms supervised methods.
    Joint Embedding of Structural and Functional Brain Networks with Graph Neural Networks for Mental Illness Diagnosis. (arXiv:2107.03220v2 [q-bio.NC] UPDATED)
    Multimodal brain networks characterize complex connectivities among different brain regions from both structural and functional aspects and provide a new means for mental disease analysis. Recently, Graph Neural Networks (GNNs) have become a de facto model for analyzing graph-structured data. However, how to employ GNNs to extract effective representations from brain networks in multiple modalities remains rarely explored. Moreover, as brain networks provide no initial node features, how to design informative node attributes and leverage edge weights for GNNs to learn is left unsolved. To this end, we develop a novel multiview GNN for multimodal brain networks. In particular, we regard each modality as a view for brain networks and employ contrastive learning for multimodal fusion. Then, we propose a GNN model which takes advantage of the message passing scheme by propagating messages based on degree statistics and brain region connectivities. Extensive experiments on two real-world disease datasets (HIV and Bipolar) demonstrate the effectiveness of our proposed method over state-of-the-art baselines.
    Nonnegative Tensor Completion via Integer Optimization. (arXiv:2111.04580v2 [cs.LG] UPDATED)
    Unlike matrix completion, tensor completion does not have an algorithm that is known to achieve the information-theoretic sample complexity rate. This paper develops a new algorithm for the special case of completion for nonnegative tensors. We prove that our algorithm converges in a linear (in numerical tolerance) number of oracle steps, while achieving the information-theoretic rate. Our approach is to define a new norm for nonnegative tensors using the gauge of a particular 0-1 polytope; integer linear programming can, in turn, be used to solve linear separation problems over this polytope. We combine this insight with a variant of the Frank-Wolfe algorithm to construct our numerical algorithm, and we demonstrate its effectiveness and scalability through computational experiments using a laptop on tensors with up to one-hundred million entries.
    Learning Deep Representation with Energy-Based Self-Expressiveness for Subspace Clustering. (arXiv:2110.15037v2 [cs.LG] UPDATED)
    Deep subspace clustering has attracted increasing attention in recent years. Almost all the existing works are required to load the whole training data into one batch for learning the self-expressive coefficients in the framework of deep learning. Although these methods achieve promising results, such a learning fashion severely prevents from the usage of deeper neural network architectures (e.g., ResNet), leading to the limited representation abilities of the models. In this paper, we propose a new deep subspace clustering framework, motivated by the energy-based models. In contrast to previous approaches taking the weights of a fully connected layer as the self-expressive coefficients, we propose to learn an energy-based network to obtain the self-expressive coefficients by mini-batch training. By this means, it is no longer necessary to load all data into one batch for learning, and it thus becomes a reality that we can utilize deeper neural network models for subspace clustering. Considering the powerful representation ability of the recently popular self-supervised learning, we attempt to leverage self-supervised representation learning to learn the dictionary. Finally, we propose a joint framework to learn both the self-expressive coefficients and dictionary simultaneously, and train the model in an end-to-end manner. The experiments are performed on three publicly available datasets, and extensive experimental results demonstrate our method can significantly outperform the other related approaches. For instance, on the three datasets, our method can averagely achieve $13.8\%$, $15.4\%$, $20.8\%$ improvements in terms of Accuracy, NMI, and ARI over SENet which is proposed very recently and obtains the second best results in the experiments.
    Bellman-consistent Pessimism for Offline Reinforcement Learning. (arXiv:2106.06926v5 [cs.LG] UPDATED)
    The use of pessimism, when reasoning about datasets lacking exhaustive exploration has recently gained prominence in offline reinforcement learning. Despite the robustness it adds to the algorithm, overly pessimistic reasoning can be equally damaging in precluding the discovery of good policies, which is an issue for the popular bonus-based pessimism. In this paper, we introduce the notion of Bellman-consistent pessimism for general function approximation: instead of calculating a point-wise lower bound for the value function, we implement pessimism at the initial state over the set of functions consistent with the Bellman equations. Our theoretical guarantees only require Bellman closedness as standard in the exploratory setting, in which case bonus-based pessimism fails to provide guarantees. Even in the special case of linear function approximation where stronger expressivity assumptions hold, our result improves upon a recent bonus-based approach by $\mathcal{O}(d)$ in its sample complexity when the action space is finite. Remarkably, our algorithms automatically adapt to the best bias-variance tradeoff in the hindsight, whereas most prior approaches require tuning extra hyperparameters a priori.
    Approximated Multi-Agent Fitted Q Iteration. (arXiv:2104.09343v4 [cs.LG] UPDATED)
    We formulate an efficient approximation for multi-agent batch reinforcement learning, the approximated multi-agent fitted Q iteration (AMAFQI). We present a detailed derivation of our approach. We propose an iterative policy search and show that it yields a greedy policy with respect to multiple approximations of the centralized, learned Q-function. In each iteration and policy evaluation, AMAFQI requires a number of computations that scales linearly with the number of agents whereas the analogous number of computations increase exponentially for the fitted Q iteration (FQI), a commonly used approaches in batch reinforcement learning. This property of AMAFQI is fundamental for the design of a tractable multi-agent approach. We evaluate the performance of AMAFQI and compare it to FQI in numerical simulations. The simulations illustrate the significant computation time reduction when using AMAFQI instead of FQI in multi-agent problems and corroborate the similar performance of both approaches.
    SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing. (arXiv:2110.07205v3 [eess.AS] UPDATED)
    Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning. The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After preprocessing the input speech/text through the pre-nets, the shared encoder-decoder network models the sequence-to-sequence transformation, and then the post-nets generate the output in the speech/text modality based on the output of the decoder. Leveraging large-scale unlabeled speech and text data, we pre-train SpeechT5 to learn a unified-modal representation, hoping to improve the modeling capability for both speech and text. To align the textual and speech information into this unified semantic space, we propose a cross-modal vector quantization approach that randomly mixes up speech/text states with latent units as the interface between encoder and decoder. Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification. We release our code and model at https://github.com/microsoft/SpeechT5.
    Differentiable Architecture Search for Reinforcement Learning. (arXiv:2106.02229v3 [cs.LG] UPDATED)
    In this paper, we investigate the fundamental question: To what extent are gradient-based neural architecture search (NAS) techniques applicable to RL? Using the original DARTS as a convenient baseline, we discover that the discrete architectures found can achieve up to 250% performance compared to manual architecture designs on both discrete and continuous action space environments across off-policy and on-policy RL algorithms, at only 3x more computation time. Furthermore, through numerous ablation studies, we systematically verify that not only does DARTS correctly upweight operations during its supernet phrase, but also gradually improves resulting discrete cells up to 30x more efficiently than random search, suggesting DARTS is surprisingly an effective tool for improving architectures in RL.
    A multi-stage machine learning model on diagnosis of esophageal manometry. (arXiv:2106.13869v2 [cs.LG] UPDATED)
    High-resolution manometry (HRM) is the primary procedure used to diagnose esophageal motility disorders. Its interpretation and classification includes an initial evaluation of swallow-level outcomes and then derivation of a study-level diagnosis based on Chicago Classification (CC), using a tree-like algorithm. This diagnostic approach on motility disordered using HRM was mirrored using a multi-stage modeling framework developed using a combination of various machine learning approaches. Specifically, the framework includes deep-learning models at the swallow-level stage and feature-based machine learning models at the study-level stage. In the swallow-level stage, three models based on convolutional neural networks (CNNs) were developed to predict swallow type, swallow pressurization, and integrated relaxation pressure (IRP). At the study-level stage, model selection from families of the expert-knowledge-based rule models, xgboost models and artificial neural network(ANN) models were conducted, with the latter two model designed and augmented with motivation from the export knowledge. A simple model-agnostic strategy of model balancing motivated by Bayesian principles was utilized, which gave rise to model averaging weighted by precision scores. The averaged (blended) models and individual models were compared and evaluated, of which the best performance on test dataset is 0.81 in top-1 prediction, 0.92 in top-2 predictions. This is the first artificial-intelligence-style model to automatically predict CC diagnosis of HRM study from raw multi-swallow data. Moreover, the proposed modeling framework could be easily extended to multi-modal tasks, such as diagnosis of esophageal patients based on clinical data from both HRM and functional luminal imaging probe panometry (FLIP).
    Learning Time-Varying Graphs from Online Data. (arXiv:2110.11017v2 [cs.LG] UPDATED)
    This work proposes an algorithmic framework to learn time-varying graphs from online data. The generality offered by the framework renders it model-independent, i.e., it can be theoretically analyzed in its abstract formulation and then instantiated under a variety of model-dependent graph learning problems. This is possible by phrasing (time-varying) graph learning as a composite optimization problem, where different functions regulate different desiderata, e.g., data fidelity, sparsity or smoothness. Instrumental for the findings is recognizing that the dependence of the majority (if not all) data-driven graph learning algorithms on the data is exerted through the empirical covariance matrix, representing a sufficient statistic for the estimation problem. Its user-defined recursive update enables the framework to work in non-stationary environments, while iterative algorithms building on novel time-varying optimization tools explicitly take into account the temporal dynamics, speeding up convergence and implicitly including a temporal-regularization of the solution. We specialize the framework to three well-known graph learning models, namely, the Gaussian graphical model (GGM), the structural equation model (SEM), and the smoothness-based model (SBM), where we also introduce ad-hoc vectorization schemes for structured matrices (symmetric, hollows, etc.) which are crucial to perform correct gradient computations, other than enabling to work in low-dimensional vector spaces and hence easing storage requirements. After discussing the theoretical guarantees of the proposed framework, we corroborate it with extensive numerical tests in synthetic and real data.
    Generalizing to Unseen Domains: A Survey on Domain Generalization. (arXiv:2103.03097v7 [cs.LG] UPDATED)
    Machine learning systems generally assume that the training and testing distributions are the same. To this end, a key requirement is to develop models that can generalize to unseen distributions. Domain generalization (DG), i.e., out-of-distribution generalization, has attracted increasing interests in recent years. Domain generalization deals with a challenging setting where one or several different but related domain(s) are given, and the goal is to learn a model that can generalize to an unseen test domain. Great progress has been made in the area of domain generalization for years. This paper presents the first review of recent advances in this area. First, we provide a formal definition of domain generalization and discuss several related fields. We then thoroughly review the theories related to domain generalization and carefully analyze the theory behind generalization. We categorize recent algorithms into three classes: data manipulation, representation learning, and learning strategy, and present several popular algorithms in detail for each category. Third, we introduce the commonly used datasets, applications, and our open-sourced codebase for fair evaluation. Finally, we summarize existing literature and present some potential research topics for the future.
    Out-of-Distribution Dynamics Detection: RL-Relevant Benchmarks and Results. (arXiv:2107.04982v2 [cs.LG] UPDATED)
    We study the problem of out-of-distribution dynamics (OODD) detection, which involves detecting when the dynamics of a temporal process change compared to the training-distribution dynamics. This is relevant to applications in control, reinforcement learning (RL), and multi-variate time-series, where changes to test time dynamics can impact the performance of learning controllers/predictors in unknown ways. This problem is particularly important in the context of deep RL, where learned controllers often overfit to the training environment. Currently, however, there is a lack of established OODD benchmarks for the types of environments commonly used in RL research. Our first contribution is to design a set of OODD benchmarks derived from common RL environments with varying types and intensities of OODD. Our second contribution is to design a strong OODD baseline approach based on recurrent implicit quantile network (RIQN), which monitors autoregressive prediction errors for OODD detection. In addition to RIQN, we introduce and test three other simpler baselines. Our final contribution is to evaluate our baseline approaches on the benchmarks to provide results for future comparison.
    Towards Continual Knowledge Learning of Language Models. (arXiv:2110.03215v4 [cs.CL] UPDATED)
    Large Language Models (LMs) are known to encode world knowledge in their parameters as they pretrain on a vast amount of web corpus, which is often utilized for performing knowledge-dependent downstream tasks such as question answering, fact-checking, and open dialogue. In real-world scenarios, the world knowledge stored in the LMs can quickly become outdated as the world changes, but it is non-trivial to avoid catastrophic forgetting and reliably acquire new knowledge while preserving invariant knowledge. To push the community towards better maintenance of ever-changing LMs, we formulate a new continual learning (CL) problem called Continual Knowledge Learning (CKL). We construct a new benchmark and metric to quantify the retention of time-invariant world knowledge, the update of outdated knowledge, and the acquisition of new knowledge. We adopt applicable recent methods from literature to create several strong baselines. Through extensive experiments, we find that CKL exhibits unique challenges that are not addressed in previous CL setups, where parameter expansion is necessary to reliably retain and learn knowledge simultaneously. By highlighting the critical causes of knowledge forgetting, we show that CKL is a challenging and important problem that helps us better understand and train ever-changing LMs. The benchmark datasets, evaluation script, and baseline code to reproduce our results are available at https://github.com/joeljang/continual-knowledge-learning.
    Recent Advances on Neural Network Pruning at Initialization. (arXiv:2103.06460v3 [cs.LG] UPDATED)
    Neural network pruning typically removes connections or neurons from a pretrained converged model; while a new pruning paradigm, pruning at initialization (PaI), attempts to prune a randomly initialized network. This paper offers the first survey concentrated on this emerging pruning fashion. We first introduce a generic formulation of neural network pruning, followed by the major classic pruning topics. Then, as the main body of this paper, a thorough and structured literature review of PaI methods is presented, consisting of two major tracks (sparse training and sparse selection). Finally, we summarize the surge of PaI compared to PaT and discuss the open problems. Apart from the dedicated literature review, this paper also offers a code base for easy sanity-checking and benchmarking of different PaI methods.
    Quasi-Equivalence of Width and Depth of Neural Networks. (arXiv:2002.02515v7 [cs.LG] UPDATED)
    While classic studies proved that wide networks allow universal approximation, recent research and successes of deep learning demonstrate the power of deep networks. Based on a symmetric consideration, we investigate if the design of artificial neural networks should have a directional preference, and what the mechanism of interaction is between the width and depth of a network. Inspired by the De Morgan law, we address this fundamental question by establishing a quasi-equivalence between the width and depth of ReLU networks in two aspects. First, we formulate two transforms for mapping an arbitrary ReLU network to a wide network and a deep network respectively for either regression or classification so that the essentially same capability of the original network can be implemented. Then, we replace the mainstream artificial neuron type with a quadratic counterpart, and utilize the factorization and continued fraction representations of the same polynomial function to construct a wide network and a deep network, respectively. Based on our findings, a deep network has a wide equivalent, and vice versa, subject to an arbitrarily small error.
    How much human-like visual experience do current self-supervised learning algorithms need in order to achieve human-level object recognition?. (arXiv:2109.11523v3 [cs.CV] UPDATED)
    This paper addresses a fundamental question: how good are our current self-supervised visual representation learning algorithms relative to humans? More concretely, how much "human-like" natural visual experience would these algorithms need in order to reach human-level performance in a complex, realistic visual object recognition task such as ImageNet? Using a scaling experiment, here we estimate that the answer is several orders of magnitude longer than a human lifetime: typically on the order of a million to a billion years of natural visual experience (depending on the algorithm used). We obtain even larger estimates for achieving human-level performance in ImageNet-derived robustness benchmarks. The exact values of these estimates are sensitive to some underlying assumptions, however even in the most optimistic scenarios they remain orders of magnitude larger than a human lifetime. We discuss the main caveats surrounding our estimates and the implications of these surprising results.
    Towards A Measure Of General Machine Intelligence. (arXiv:2109.12075v4 [cs.AI] UPDATED)
    To build general-purpose artificial intelligence systems that can deal with unknown variables across unknown domains, we need benchmarks that measure how well these systems perform on tasks they have never seen before. A prerequisite for this is a measure of a task's generalization difficulty, or how dissimilar it is from the system's prior knowledge and experience. If the skill of an intelligence system in a particular domain is defined as it's ability to consistently generate a set of instructions (or programs) to solve tasks in that domain, current benchmarks do not quantitatively measure the efficiency of acquiring new skills, making it possible to brute-force skill acquisition by training with unlimited amounts of data and compute power. With this in mind, we first propose a common language of instruction, a programming language that allows the expression of programs in the form of directed acyclic graphs across a wide variety of real-world domains and computing platforms. Using programs generated in this language, we demonstrate a match-based method to both score performance and calculate the generalization difficulty of any given set of tasks. We use these to define a numeric benchmark called the generalization index, or the g-index, to measure and compare the skill-acquisition efficiency of any intelligence system on a set of real-world tasks. Finally, we evaluate the suitability of some well-known models as general intelligence systems by calculating their g-index scores.
    3D Infomax improves GNNs for Molecular Property Prediction. (arXiv:2110.04126v3 [cs.LG] UPDATED)
    Molecular property prediction is one of the fastest-growing applications of deep learning with critical real-world impacts. Including 3D molecular structure as input to learned models improves their performance for many molecular tasks. However, this information is infeasible to compute at the scale required by several real-world applications. We propose pre-training a model to reason about the geometry of molecules given only their 2D molecular graphs. Using methods from self-supervised learning, we maximize the mutual information between 3D summary vectors and the representations of a Graph Neural Network (GNN) such that they contain latent 3D information. During fine-tuning on molecules with unknown geometry, the GNN still generates implicit 3D information and can use it to improve downstream tasks. We show that 3D pre-training provides significant improvements for a wide range of properties, such as a 22% average MAE reduction on eight quantum mechanical properties. Moreover, the learned representations can be effectively transferred between datasets in different molecular spaces.
    Factor Analysis, Probabilistic Principal Component Analysis, Variational Inference, and Variational Autoencoder: Tutorial and Survey. (arXiv:2101.00734v2 [stat.ML] UPDATED)
    This is a tutorial and survey paper on factor analysis, probabilistic Principal Component Analysis (PCA), variational inference, and Variational Autoencoder (VAE). These methods, which are tightly related, are dimensionality reduction and generative models. They assume that every data point is generated from or caused by a low-dimensional latent factor. By learning the parameters of distribution of latent space, the corresponding low-dimensional factors are found for the sake of dimensionality reduction. For their stochastic and generative behaviour, these models can also be used for generation of new data points in the data space. In this paper, we first start with variational inference where we derive the Evidence Lower Bound (ELBO) and Expectation Maximization (EM) for learning the parameters. Then, we introduce factor analysis, derive its joint and marginal distributions, and work out its EM steps. Probabilistic PCA is then explained, as a special case of factor analysis, and its closed-form solutions are derived. Finally, VAE is explained where the encoder, decoder and sampling from the latent space are introduced. Training VAE using both EM and backpropagation are explained.
    FedTune: Automatic Tuning of Federated Learning Hyper-Parameters from System Perspective. (arXiv:2110.03061v4 [cs.LG] UPDATED)
    Federated learning (FL) hyper-parameters significantly affect the training overheads in terms of computation time, transmission time, computation load, and transmission load. However, the current practice of manually selecting FL hyper-parameters puts a high burden on FL practitioners since various applications prefer different training preferences. In this paper, we propose FedTune, an automatic FL hyper-parameter tuning algorithm tailored to applications' diverse system requirements of FL training. FedTune is lightweight and flexible, achieving 8.48%-26.75% improvement for different datasets compared to fixed FL hyper-parameters.
    Weakly-supervised Multi-output Regression via Correlated Gaussian Processes. (arXiv:2002.08412v2 [stat.ML] UPDATED)
    Multi-output regression seeks to borrow strength and leverage commonalities across different but related outputs in order to enhance learning and prediction accuracy. A fundamental assumption is that the output/group membership labels for all observations are known. This assumption is often violated in real applications. For instance, in healthcare datasets, sensitive attributes such as ethnicity are often missing or unreported. To this end, we introduce a weakly-supervised multi-output model based on dependent Gaussian processes. Our approach is able to leverage data without complete group labels or possibly only prior belief on group memberships to enhance accuracy across all outputs. Through intensive simulations and case studies on an Insulin, Testosterone and Bodyfat dataset, we show that our model excels in multi-output settings with missing labels, while being competitive in traditional fully labeled settings. We end by highlighting the possible use of our approach in fair inference and sequential decision-making.
    Effects of Image Size on Deep Learning. (arXiv:2101.11508v3 [cs.CV] UPDATED)
    This paper presents the evaluation of the effects of image size on deep learning performance via semantic segmentation of magnetic resonance heart images with U-net for fully automated quantification of myocardial infarction. Both non-extra pixel and extra pixel interpolation algorithms are used to change the size of images in datasets of interest. Extra class labels, in interpolated ground truth segmentation images, are removed using thresholding, median filtering, and subtraction strategies. Common class metrics are used to evaluate the quality of semantic segmentation with U-net against the ground truth segmentation while arbitrary threshold, comparison of the sums, and sums of differences between medical experts and fully automated results are options used to estimate the relationship between medical experts-based quantification and fully automated quantification results.
    MAGMA: Inference and Prediction with Multi-Task Gaussian Processes. (arXiv:2007.10731v2 [stat.CO] UPDATED)
    A novel multi-task Gaussian process (GP) framework is proposed, by using a common mean process for sharing information across tasks. In particular, we investigate the problem of time series forecasting, with the objective to improve multiple-step-ahead predictions. The common mean process is defined as a GP for which the hyper-posterior distribution is tractable. Therefore an EM algorithm is derived for handling both hyper-parameters optimisation and hyper-posterior computation. Unlike previous approaches in the literature, the model fully accounts for uncertainty and can handle irregular grids of observations while maintaining explicit formulations, by modelling the mean process in a unified GP framework. Predictive analytical equations are provided, integrating information shared across tasks through a relevant prior mean. This approach greatly improves the predictive performances, even far from observations, and may reduce significantly the computational complexity compared to traditional multi-task GP models. Our overall algorithm is called \textsc{Magma} (standing for Multi tAsk Gaussian processes with common MeAn). The quality of the mean process estimation, predictive performances, and comparisons to alternatives are assessed in various simulated scenarios and on real datasets.
    RevUp: Revise and Update Information Bottleneck for Event Representation. (arXiv:2205.12248v1 [cs.LG])
    In machine learning, latent variables play a key role to capture the underlying structure of data, but they are often unsupervised. When we have side knowledge that already has high-level information about the input data, we can use that source to guide latent variables and capture the available background information in a process called "parameter injection." In that regard, we propose a semi-supervised information bottleneck-based model that enables the use of side knowledge, even if it is noisy and imperfect, to direct the learning of discrete latent variables. Fundamentally, we introduce an auxiliary continuous latent variable as a way to reparameterize the model's discrete variables with a light-weight hierarchical structure. With this reparameterization, the model's discrete latent variables are learned to minimize the mutual information between the observed data and optional side knowledge that is not already captured by the new, auxiliary variables. We theoretically show that our approach generalizes an existing method of parameter injection, and perform an empirical case study of our approach on language-based event modeling. We corroborate our theoretical results with strong empirical experiments, showing that the proposed method outperforms previous proposed approaches on multiple datasets.
    DeepKriging: Spatially Dependent Deep Neural Networks for Spatial Prediction. (arXiv:2007.11972v4 [stat.ML] UPDATED)
    In spatial statistics, a common objective is to predict values of a spatial process at unobserved locations by exploiting spatial dependence. Kriging provides the best linear unbiased predictor using covariance functions and is often associated with Gaussian processes. However, when considering non-linear prediction for non-Gaussian and categorical data, the Kriging prediction is no longer optimal, and the associated variance is often overly optimistic. Although deep neural networks (DNNs) are widely used for general classification and prediction, they have not been studied thoroughly for data with spatial dependence. In this work, we propose a novel DNN structure for spatial prediction, where the spatial dependence is captured by adding an embedding layer of spatial coordinates with basis functions. We show in theory and simulation studies that the proposed DeepKriging method has a direct link to Kriging in the Gaussian case, and it has multiple advantages over Kriging for non-Gaussian and non-stationary data, i.e., it provides non-linear predictions and thus has smaller approximation errors, it does not require operations on covariance matrices and thus is scalable for large datasets, and with sufficiently many hidden neurons, it provides the optimal prediction in terms of model capacity. We further explore the possibility of quantifying prediction uncertainties based on density prediction without assuming any data distribution. Finally, we apply the method to predicting PM2.5 concentrations across the continental United States.
    Risk-Sensitive Reinforcement Learning via Policy Gradient Search. (arXiv:1810.09126v3 [cs.LG] UPDATED)
    The objective in a traditional reinforcement learning (RL) problem is to find a policy that optimizes the expected value of a performance metric such as the infinite-horizon cumulative discounted or long-run average cost/reward. In practice, optimizing the expected value alone may not be satisfactory, in that it may be desirable to incorporate the notion of risk into the optimization problem formulation, either in the objective or as a constraint. Various risk measures have been proposed in the literature, e.g., exponential utility, variance, percentile performance, chance constraints, value at risk (quantile), conditional value-at-risk, prospect theory and its later enhancement, cumulative prospect theory. In this book, we consider risk-sensitive RL in two settings: one where the goal is to find a policy that optimizes the usual expected value objective while ensuring that a risk constraint is satisfied, and the other where the risk measure is the objective. We survey some of the recent work in this area specifically where policy gradient search is the solution approach. In the first risk-sensitive RL setting, we cover popular risk measures based on variance, conditional value-at-risk, and chance constraints, and present a template for policy gradient-based risk-sensitive RL algorithms using a Lagrangian formulation. For the setting where risk is incorporated directly into the objective function, we consider an exponential utility formulation, cumulative prospect theory, and coherent risk measures. This non-exhaustive survey aims to give a flavor of the challenges involved in solving risk-sensitive RL problems using policy gradient methods, as well as outlining some potential future research directions.
    DoorGym: A Scalable Door Opening Environment And Baseline Agent. (arXiv:1908.01887v4 [cs.RO] UPDATED)
    In order to practically implement the door opening task, a policy ought to be robust to a wide distribution of door types and environment settings. Reinforcement Learning (RL) with Domain Randomization (DR) is a promising technique to enforce policy generalization, however, there are only a few accessible training environments that are inherently designed to train agents in domain randomized environments. We introduce DoorGym, an open-source door opening simulation framework designed to utilize domain randomization to train a stable policy. We intend for our environment to lie at the intersection of domain transfer, practical tasks, and realism. We also provide baseline Proximal Policy Optimization and Soft Actor-Critic implementations, which achieves success rates between 0% up to 95% for opening various types of doors in this environment. Moreover, the real-world transfer experiment shows the trained policy is able to work in the real world. Environment kit available here: https://github.com/PSVL/DoorGym/
    Modular Meta-Learning for Power Control via Random Edge Graph Neural Networks. (arXiv:2108.13178v2 [cs.NI] CROSS LISTED)
    In this paper, we consider the problem of power control for a wireless network with an arbitrarily time-varying topology, including the possible addition or removal of nodes. A data-driven design methodology that leverages graph neural networks (GNNs) is adopted in order to efficiently parametrize the power control policy mapping the channel state information (CSI) to transmit powers. The specific GNN architecture, known as random edge GNN (REGNN), defines a non-linear graph convolutional filter whose spatial weights are tied to the channel coefficients. While prior work assumed a joint training approach whereby the REGNN-based policy is shared across all topologies, this paper targets adaptation of the power control policy based on limited CSI data regarding the current topology. To this end, we propose a novel modular meta-learning technique that enables the efficient optimization of module assignment. While black-box meta-learning optimizes a general-purpose adaptation procedure via (stochastic) gradient descent, modular meta-learning finds a set of reusable modules that can form components of a solution for any new network topology. Numerical results validate the benefits of meta-learning for power control problems over joint training schemes, and demonstrate the advantages of modular meta-learning when data availability is extremely limited.
    EBM Life Cycle: MCMC Strategies for Synthesis, Defense, and Density Modeling. (arXiv:2205.12243v1 [stat.ML])
    This work presents strategies to learn an Energy-Based Model (EBM) according to the desired length of its MCMC sampling trajectories. MCMC trajectories of different lengths correspond to models with different purposes. Our experiments cover three different trajectory magnitudes and learning outcomes: 1) shortrun sampling for image generation; 2) midrun sampling for classifier-agnostic adversarial defense; and 3) longrun sampling for principled modeling of image probability densities. To achieve these outcomes, we introduce three novel methods of MCMC initialization for negative samples used in Maximum Likelihood (ML) learning. With standard network architectures and an unaltered ML objective, our MCMC initialization methods alone enable significant performance gains across the three applications that we investigate. Our results include state-of-the-art FID scores for unnormalized image densities on the CIFAR-10 and ImageNet datasets; state-of-the-art adversarial defense on CIFAR-10 among purification methods and the first EBM defense on ImageNet; and scalable techniques for learning valid probability densities. Code for this project can be found at https://github.com/point0bar1/ebm-life-cycle.
    Taming the sign problem of explicitly antisymmetrized neural networks via rough activation functions. (arXiv:2205.12250v1 [cs.LG])
    Explicit antisymmetrization of a two-layer neural network is a potential candidate for a universal function approximator for generic antisymmetric functions, which are ubiquitous in quantum physics. However, this strategy suffers from a sign problem, namely, due to near exact cancellation of positive and negative contributions, the magnitude of the antisymmetrized function may be significantly smaller than that before antisymmetrization. We prove that the severity of the sign problem is directly related to the smoothness of the activation function. For smooth activation functions (e.g., $\tanh$), the sign problem of the explicitly antisymmetrized two-layer neural network deteriorates super-polynomially with respect to the system size. On the other hand, for rough activation functions (e.g., ReLU), the deterioration rate of the sign problem can be tamed to be at most polynomial with respect to the system size. Finally, the cost of a direct implementation of antisymmetrized two-layer neural network scales factorially with respect to the system size. We describe an efficient algorithm for approximate evaluation of such a network, of which the cost scales polynomially with respect to the system size and inverse precision.
    Interpretation Quality Score for Measuring the Quality of interpretability methods. (arXiv:2205.12254v1 [cs.CL])
    Machine learning (ML) models have been applied to a wide range of natural language processing (NLP) tasks in recent years. In addition to making accurate decisions, the necessity of understanding how models make their decisions has become apparent in many applications. To that end, many interpretability methods that help explain the decision processes of ML models have been developed. Yet, there currently exists no widely-accepted metric to evaluate the quality of explanations generated by these methods. As a result, there currently is no standard way of measuring to what degree an interpretability method achieves an intended objective. Moreover, there is no accepted standard of performance by which we can compare and rank the current existing interpretability methods. In this paper, we propose a novel metric for quantifying the quality of explanations generated by interpretability methods. We compute the metric on three NLP tasks using six interpretability methods and present our results.
    Adversarial Attack on Attackers: Post-Process to Mitigate Black-Box Score-Based Query Attacks. (arXiv:2205.12134v1 [cs.LG])
    The score-based query attacks (SQAs) pose practical threats to deep neural networks by crafting adversarial perturbations within dozens of queries, only using the model's output scores. Nonetheless, we note that if the loss trend of the outputs is slightly perturbed, SQAs could be easily misled and thereby become much less effective. Following this idea, we propose a novel defense, namely Adversarial Attack on Attackers (AAA), to confound SQAs towards incorrect attack directions by slightly modifying the output logits. In this way, (1) SQAs are prevented regardless of the model's worst-case robustness; (2) the original model predictions are hardly changed, i.e., no degradation on clean accuracy; (3) the calibration of confidence scores can be improved simultaneously. Extensive experiments are provided to verify the above advantages. For example, by setting $\ell_\infty=8/255$ on CIFAR-10, our proposed AAA helps WideResNet-28 secure $80.59\%$ accuracy under Square attack ($2500$ queries), while the best prior defense (i.e., adversarial training) only attains $67.44\%$. Since AAA attacks SQA's general greedy strategy, such advantages of AAA over 8 defenses can be consistently observed on 8 CIFAR-10/ImageNet models under 6 SQAs, using different attack targets and bounds. Moreover, AAA calibrates better without hurting the accuracy. Our code would be released.
    Asynchronous Neural Networks for Learning in Graphs. (arXiv:2205.12245v1 [cs.LG])
    This paper studies asynchronous message passing (AMP), a new paradigm for applying neural network based learning to graphs. Existing graph neural networks use the synchronous distributed computing model and aggregate their neighbors in each round, which causes problems such as oversmoothing and limits their expressiveness. On the other hand, AMP is based on the asynchronous model, where nodes react to messages of their neighbors individually. We prove that (i) AMP can simulate synchronous GNNs and that (ii) AMP can theoretically distinguish any pair of graphs. We experimentally validate AMP's expressiveness. Further, we show that AMP might be better suited to propagate messages over large distances in graphs and performs well on several graph classification benchmarks.
    One-Pixel Shortcut: on the Learning Preference of Deep Neural Networks. (arXiv:2205.12141v1 [cs.LG])
    Unlearnable examples (ULEs) aim to protect data from unauthorized usage for training DNNs. Error-minimizing noise, which is injected to clean data, is one of the most successful methods for preventing DNNs from giving correct predictions on incoming new data. Nonetheless, under specific training strategies such as adversarial training, the unlearnability of error-minimizing noise will severely degrade. In addition, the transferability of error-minimizing noise is inherently limited by the mismatch between the generator model and the targeted learner model. In this paper, we investigate the mechanism of unlearnable examples and propose a novel model-free method, named \emph{One-Pixel Shortcut}, which only perturbs a single pixel of each image and makes the dataset unlearnable. Our method needs much less computational cost and obtains stronger transferability and thus can protect data from a wide range of different models. Based on this, we further introduce the first unlearnable dataset called CIFAR-10-S, which is indistinguishable from normal CIFAR-10 by human observers and can serve as a benchmark for different models or training strategies to evaluate their abilities to extract critical features from the disturbance of non-semantic representations. The original error-minimizing ULEs will lose efficiency under adversarial training, where the model can get over 83\% clean test accuracy. Meanwhile, even if adversarial training and strong data augmentation like RandAugment are applied together, the model trained on CIFAR-10-S cannot get over 50\% clean test accuracy.
    Gacs-Korner Common Information Variational Autoencoder. (arXiv:2205.12239v1 [cs.LG])
    We propose a notion of common information that allows one to quantify and separate the information that is shared between two random variables from the information that is unique to each. Our notion of common information is a variational relaxation of the G\'acs-K\"orner common information, which we recover as a special case, but is more amenable to optimization and can be approximated empirically using samples from the underlying distribution. We then provide a method to partition and quantify the common and unique information using a simple modification of a traditional variational auto-encoder. Empirically, we demonstrate that our formulation allows us to learn semantically meaningful common and unique factors of variation even on high-dimensional data such as images and videos. Moreover, on datasets where ground-truth latent factors are known, we show that we can accurately quantify the common information between the random variables. Additionally, we show that the auto-encoder that we learn recovers semantically meaningful disentangled factors of variation, even though we do not explicitly optimize for it.
    Federated singular value decomposition for high dimensional data. (arXiv:2205.12109v1 [cs.LG])
    Federated learning (FL) is emerging as a privacy-aware alternative to classical cloud-based machine learning. In FL, the sensitive data remains in data silos and only aggregated parameters are exchanged. Hospitals and research institutions which are not willing to share their data can join a federated study without breaching confidentiality. In addition to the extreme sensitivity of biomedical data, the high dimensionality poses a challenge in the context of federated genome-wide association studies (GWAS). In this article, we present a federated singular value decomposition (SVD) algorithm, suitable for the privacy-related and computational requirements of GWAS. Notably, the algorithm has a transmission cost independent of the number of samples and is only weakly dependent on the number of features, because the singular vectors associated with the samples are never exchanged and the vectors associated with the features only for a fixed number of iterations. Although motivated by GWAS, the algorithm is generically applicable for both horizontally and vertically partitioned data.
    Rethinking Evaluation Practices in Visual Question Answering: A Case Study on Out-of-Distribution Generalization. (arXiv:2205.12191v1 [cs.CL])
    Vision-and-language (V&L) models pretrained on large-scale multimodal data have demonstrated strong performance on various tasks such as image captioning and visual question answering (VQA). The quality of such models is commonly assessed by measuring their performance on unseen data that typically comes from the same distribution as the training data. However, we observe that these models exhibit poor out-of-distribution (OOD) generalization on the task of VQA. To better understand the underlying causes of poor generalization, we comprehensively investigate performance of two pretrained V&L models under different settings (i.e. classification and open-ended text generation) by conducting cross-dataset evaluations. We find that these models tend to learn to solve the benchmark, rather than learning the high-level skills required by the VQA task. We also argue that in most cases generative models are less susceptible to shifts in data distribution, while frequently performing better on our tested benchmarks. Moreover, we find that multimodal pretraining improves OOD performance in most settings. Finally, we revisit assumptions underlying the use of automatic VQA evaluation metrics, and empirically show that their stringent nature repeatedly penalizes models for correct responses.
    Individual Topology Structure of Eye Movement Trajectories. (arXiv:2205.10667v2 [cs.CV] UPDATED)
    Traditionally, extracting patterns from eye movement data relies on statistics of different macro-events such as fixations and saccades. This requires an additional preprocessing step to separate the eye movement subtypes, often with a number of parameters on which the classification results depend. Besides that, definitions of such macro events are formulated in different ways by different researchers. We propose an application of a new class of features to the quantitative analysis of personal eye movement trajectories structure. This new class of features based on algebraic topology allows extracting patterns from different modalities of gaze such as time series of coordinates and amplitudes, heatmaps, and point clouds in a unified way at all scales from micro to macro. We experimentally demonstrate the competitiveness of the new class of features with the traditional ones and their significant synergy while being used together for the person authentication task on the recently published eye movement trajectories dataset.
    D$^\text{2}$UF: Deep Coded Aperture Design and Unrolling Algorithm for Compressive Spectral Image Fusion. (arXiv:2205.12158v1 [eess.IV])
    Compressive spectral imaging (CSI) has attracted significant attention since it employs synthetic apertures to codify spatial and spectral information, sensing only 2D projections of the 3D spectral image. However, these optical architectures suffer from a trade-off between the spatial and spectral resolution of the reconstructed image due to technology limitations. To overcome this issue, compressive spectral image fusion (CSIF) employs the projected measurements of two CSI architectures with different resolutions to estimate a high-spatial high-spectral resolution. This work presents the fusion of the compressive measurements of a low-spatial high-spectral resolution coded aperture snapshot spectral imager (CASSI) architecture and a high-spatial low-spectral resolution multispectral color filter array (MCFA) system. Unlike previous CSIF works, this paper proposes joint optimization of the sensing architectures and a reconstruction network in an end-to-end (E2E) manner. The trainable optical parameters are the coded aperture (CA) in the CASSI and the colored coded aperture in the MCFA system, employing a sigmoid activation function and regularization function to encourage binary values on the trainable variables for an implementation purpose. Additionally, an unrolling-based network inspired by the alternating direction method of multipliers (ADMM) optimization is formulated to address the reconstruction step and the acquisition systems design jointly. Finally, a spatial-spectral inspired loss function is employed at the end of each unrolling layer to increase the convergence of the unrolling network. The proposed method outperforms previous CSIF methods, and experimental results validate the method with real measurements.
    Deeper vs Wider: A Revisit of Transformer Configuration. (arXiv:2205.10505v2 [cs.LG] UPDATED)
    Transformer-based models have delivered impressive results on many tasks, particularly vision and language tasks. In many model training situations, conventional configurations are typically adopted. For example, we often set the base model with hidden dimensions (i.e. model width) to be 768 and the number of transformer layers (i.e. model depth) to be 12. In this paper, we revisit these conventional configurations. Through theoretical analysis and experimental evaluation, we show that the masked autoencoder is effective in alleviating the over-smoothing issue in deep transformer training. Based on this finding, we propose Bamboo, an idea of using deeper and narrower transformer configurations, for masked autoencoder training. On ImageNet, with such a simple change in configuration, re-designed model achieves 87.1% top-1 accuracy and outperforms SoTA models like MAE and BEiT. On language tasks, re-designed model outperforms BERT with default setting by 1.1 points on average, on GLUE datasets.
    Bias Discovery in Machine Learning Models for Mental Health. (arXiv:2205.12093v1 [cs.LG])
    Fairness and bias are crucial concepts in artificial intelligence, yet they are relatively ignored in machine learning applications in clinical psychiatry. We computed fairness metrics and present bias mitigation strategies using a model trained on clinical mental health data. We collected structured data related to the admission, diagnosis, and treatment of patients in the psychiatry department of the University Medical Center Utrecht. We trained a machine learning model to predict future administrations of benzodiazepines on the basis of past data. We found that gender plays an unexpected role in the predictions-this constitutes bias. Using the AI Fairness 360 package, we implemented reweighing and discrimination-aware regularization as bias mitigation strategies, and we explored their implications for model performance. This is the first application of bias exploration and mitigation in a machine learning model trained on real clinical psychiatry data.
    Inference of a Rumor's Source in the Independent Cascade Model. (arXiv:2205.12125v1 [cs.SI])
    We consider the so-called Independent Cascade Model for rumor spreading or epidemic processes popularized by Kempe et al.\ [2003]. In this model, a small subset of nodes from a network are the source of a rumor. In discrete time steps, each informed node "infects" each of its uninformed neighbors with probability $p$. While many facets of this process are studied in the literature, less is known about the inference problem: given a number of infected nodes in a network, can we learn the source of the rumor? In the context of epidemiology this problem is often referred to as patient zero problem. It belongs to a broader class of problems where the goal is to infer parameters of the underlying spreading model, see, e.g., Lokhov [NeurIPS'16] or Mastakouri et al. [NeurIPS'20]. In this work we present a maximum likelihood estimator for the rumor's source, given a snapshot of the process in terms of a set of active nodes $X$ after $t$ steps. Our results show that, for cycle-free graphs, the likelihood estimator undergoes a non-trivial phase transition as a function $t$. We provide a rigorous analysis for two prominent classes of acyclic network, namely $d$-regular trees and Galton-Watson trees, and verify empirically that our heuristics work well in various general networks.
    Phased Progressive Learning with Coupling-Regulation-Imbalance Loss for Imbalanced Classification. (arXiv:2205.12117v1 [cs.LG])
    Deep neural networks generally perform poorly with datasets that suffer from quantity imbalance and classification difficulty imbalance between different classes. In order to alleviate the problem of dataset bias or domain shift in the existing two-stage approaches, a phased progressive learning schedule was proposed for smoothly transferring the training emphasis from representation learning to upper classifier training. This has greater effectivity on datasets that have more severe imbalances or smaller scales. A coupling-regulation-imbalance loss function was designed, coupling a correction term, Focal loss and LDAM loss. Coupling-regulation-imbalance loss can better deal with quantity imbalance and outliers, while regulating focus-of-attention of samples with a variety of classification difficulties. Excellent results were achieved on multiple benchmark datasets using these approaches and they can be easily generalized for other imbalanced classification models. Our code will be open source soon.
    Not too little, not too much: a theoretical analysis of graph (over)smoothing. (arXiv:2205.12156v1 [stat.ML])
    We analyze graph smoothing with \emph{mean aggregation}, where each node successively receives the average of the features of its neighbors. Indeed, it has quickly been observed that Graph Neural Networks (GNNs), which generally follow some variant of Message-Passing (MP) with repeated aggregation, may be subject to the \emph{oversmoothing} phenomenon: by performing too many rounds of MP, the node features tend to converge to a non-informative limit. In the case of mean aggregation, for connected graphs, the node features become constant across the whole graph. At the other end of the spectrum, it is intuitively obvious that \emph{some} MP rounds are necessary, but existing analyses do not exhibit both phenomena at once: beneficial ``finite'' smoothing and oversmoothing in the limit. In this paper, we consider simplified linear GNNs, and rigorously analyze two examples for which a finite number of mean aggregation steps provably improves the learning performance, before oversmoothing kicks in. We consider a latent space random graph model, where node features are partial observations of the latent variables and the graph contains pairwise relationships between them. We show that graph smoothing restores some of the lost information, up to a certain point, by two phenomenon: graph smoothing shrinks non-principal directions in the data faster than principal ones, which is useful for regression, and shrinks nodes within communities faster than they collapse together, which improves classification.
    EXPANSE: A Deep Continual / Progressive Learning System for Deep Transfer Learning. (arXiv:2205.10356v2 [cs.LG] UPDATED)
    Deep transfer learning techniques try to tackle the limitations of deep learning, the dependency on extensive training data and the training costs, by reusing obtained knowledge. However, the current DTL techniques suffer from either catastrophic forgetting dilemma (losing the previously obtained knowledge) or overly biased pre-trained models (harder to adapt to target data) in finetuning pre-trained models or freezing a part of the pre-trained model, respectively. Progressive learning, a sub-category of DTL, reduces the effect of the overly biased model in the case of freezing earlier layers by adding a new layer to the end of a frozen pre-trained model. Even though it has been successful in many cases, it cannot yet handle distant source and target data. We propose a new continual/progressive learning approach for deep transfer learning to tackle these limitations. To avoid both catastrophic forgetting and overly biased-model problems, we expand the pre-trained model by expanding pre-trained layers (adding new nodes to each layer) in the model instead of only adding new layers. Hence the method is named EXPANSE. Our experimental results confirm that we can tackle distant source and target data using this technique. At the same time, the final model is still valid on the source data, achieving a promising deep continual learning approach. Moreover, we offer a new way of training deep learning models inspired by the human education system. We termed this two-step training: learning basics first, then adding complexities and uncertainties. The evaluation implies that the two-step training extracts more meaningful features and a finer basin on the error surface since it can achieve better accuracy in comparison to regular training. EXPANSE (model expansion and two-step training) is a systematic continual learning approach applicable to different problems and DL models.
    SepIt Approaching a Single Channel Speech Separation Bound. (arXiv:2205.11801v1 [eess.AS])
    We present an upper bound for the Single Channel Speech Separation task, which is based on an assumption regarding the nature of short segments of speech. Using the bound, we are able to show that while the recent methods have made significant progress for a few speakers, there is room for improvement for five and ten speakers. We then introduce a Deep neural network, SepIt, that iteratively improves the different speakers' estimation. At test time, SpeIt has a varying number of iterations per test sample, based on a mutual information criterion that arises from our analysis. In an extensive set of experiments, SepIt outperforms the state-of-the-art neural networks for 2, 3, 5, and 10 speakers.
    Telling Stories from Computational Notebooks: AI-Assisted Presentation Slides Creation for Presenting Data Science Work. (arXiv:2203.11085v2 [cs.HC] UPDATED)
    Creating presentation slides is a critical but time-consuming task for data scientists. While researchers have proposed many AI techniques to lift data scientists' burden on data preparation and model selection, few have targeted the presentation creation task. Based on the needs identified from a formative study, this paper presents NB2Slides, an AI system that facilitates users to compose presentations of their data science work. NB2Slides uses deep learning methods as well as example-based prompts to generate slides from computational notebooks, and take users' input (e.g., audience background) to structure the slides. NB2Slides also provides an interactive visualization that links the slides with the notebook to help users further edit the slides. A follow-up user evaluation with 12 data scientists shows that participants believed NB2Slides can improve efficiency and reduces the complexity of creating slides. Yet, participants questioned the future of full automation and suggested a human-AI collaboration paradigm.
    Deep Reinforcement Learning for Multi-class Imbalanced Training. (arXiv:2205.12070v1 [cs.LG])
    With the rapid growth of memory and computing power, datasets are becoming increasingly complex and imbalanced. This is especially severe in the context of clinical data, where there may be one rare event for many cases in the majority class. We introduce an imbalanced classification framework, based on reinforcement learning, for training extremely imbalanced data sets, and extend it for use in multi-class settings. We combine dueling and double deep Q-learning architectures, and formulate a custom reward function and episode-training procedure, specifically with the added capability of handling multi-class imbalanced training. Using real-world clinical case studies, we demonstrate that our proposed framework outperforms current state-of-the-art imbalanced learning methods, achieving more fair and balanced classification, while also significantly improving the prediction of minority classes.
    RuMedBench: A Russian Medical Language Understanding Benchmark. (arXiv:2201.06499v2 [cs.CL] UPDATED)
    The paper describes the open Russian medical language understanding benchmark covering several task types (classification, question answering, natural language inference, named entity recognition) on a number of novel text sets. Given the sensitive nature of the data in healthcare, such a benchmark partially closes the problem of Russian medical dataset absence. We prepare the unified format labeling, data split, and evaluation metrics for new tasks. The remaining tasks are from existing datasets with a few modifications. A single-number metric expresses a model's ability to cope with the benchmark. Moreover, we implement several baseline models, from simple ones to neural networks with transformer architecture, and release the code. Expectedly, the more advanced models yield better performance, but even a simple model is enough for a decent result in some tasks. Furthermore, for all tasks, we provide a human evaluation. Interestingly the models outperform humans in the large-scale classification tasks. However, the advantage of natural intelligence remains in the tasks requiring more knowledge and reasoning.
    KQGC: Knowledge Graph Embedding with Smoothing Effects of Graph Convolutions for Recommendation. (arXiv:2205.12102v1 [cs.IR])
    Leveraging graphs on recommender systems has gained popularity with the development of graph representation learning (GRL). In particular, knowledge graph embedding (KGE) and graph neural networks (GNNs) are representative GRL approaches, which have achieved the state-of-the-art performance on several recommendation tasks. Furthermore, combination of KGE and GNNs (KG-GNNs) has been explored and found effective in many academic literatures. One of the main characteristics of GNNs is their ability to retain structural properties among neighbors in the resulting dense representation, which is usually coined as smoothing. The smoothing is specially desired in the presence of homophilic graphs, such as the ones we find on recommender systems. In this paper, we propose a new model for recommender systems named Knowledge Query-based Graph Convolution (KQGC). In contrast to exisiting KG-GNNs, KQGC focuses on the smoothing, and leverages a simple linear graph convolution for smoothing KGE. A pre-trained KGE is fed into KQGC, and it is smoothed by aggregating neighbor knowledge queries, which allow entity-embeddings to be aligned on appropriate vector points for smoothing KGE effectively. We apply the proposed KQGC to a recommendation task that aims prospective users for specific products. Extensive experiments on a real E-commerce dataset demonstrate the effectiveness of KQGC.
    DeepCore: A Comprehensive Library for Coreset Selection in Deep Learning. (arXiv:2204.08499v2 [cs.LG] UPDATED)
    Coreset selection, which aims to select a subset of the most informative training samples, is a long-standing learning problem that can benefit many downstream tasks such as data-efficient learning, continual learning, neural architecture search, active learning, etc. However, many existing coreset selection methods are not designed for deep learning, which may have high complexity and poor generalization performance. In addition, the recently proposed methods are evaluated on models, datasets, and settings of different complexities. To advance the research of coreset selection in deep learning, we contribute a comprehensive code library, namely DeepCore, and provide an empirical study on popular coreset selection methods on CIFAR10 and ImageNet datasets. Extensive experiments on CIFAR10 and ImageNet datasets verify that, although various methods have advantages in certain experiment settings, random selection is still a strong baseline.
    Simple Techniques Work Surprisingly Well for Neural Network Test Prioritization and Active Learning (Replicability Study). (arXiv:2205.00664v2 [cs.LG] UPDATED)
    Test Input Prioritizers (TIP) for Deep Neural Networks (DNN) are an important technique to handle the typically very large test datasets efficiently, saving computation and labeling costs. This is particularly true for large-scale, deployed systems, where inputs observed in production are recorded to serve as potential test or training data for the next versions of the system. Feng et. al. propose DeepGini, a very fast and simple TIP, and show that it outperforms more elaborate techniques such as neuron- and surprise coverage. In a large-scale study (4 case studies, 8 test datasets, 32'200 trained models) we verify their findings. However, we also find that other comparable or even simpler baselines from the field of uncertainty quantification, such as the predicted softmax likelihood or the entropy of the predicted softmax likelihoods perform equally well as DeepGini.
    Selecting Continuous Life-Like Cellular Automata for Halting Unpredictability: Evolving for Abiogenesis. (arXiv:2204.07541v2 [cs.NE] UPDATED)
    Substantial efforts have been applied to engineer CA with desired emergent properties, such as supporting gliders. Recent work in continuous CA has generated a wide variety of compelling bioreminiscent patterns, and the expansion of CA research into continuously-valued domains, multiple channels, and higher dimensions complicates their study. In this work we devise a strategy for evolving CA and CA patterns in two steps, based on the simple idea that CA are likely to be complex and computationally capable if they support patterns that grow indefinitely as well as patterns that vanish completely, and are difficult to predict the difference in advance. The second part of our strategy evolves patterns by selecting for mobility and conservation of mean cell value. We validate our pattern evolution method by re-discovering gliders in 17 of 17 Lenia CA, and also report 4 new evolved CA and 1 randomly evolved CA that support novel evolved glider patterns. The CA reported here share neighborhood kernels with previously described Lenia CA, but exhibit a wider range of typical dynamics than their Lenia counterparts. Code for evolving continuous CA is made available under an MIT License (https://github.com/rivesunder/yuca).
    Learning Stabilizing Policies in Stochastic Control Systems. (arXiv:2205.11991v1 [cs.LG])
    In this work, we address the problem of learning provably stable neural network policies for stochastic control systems. While recent work has demonstrated the feasibility of certifying given policies using martingale theory, the problem of how to learn such policies is little explored. Here, we study the effectiveness of jointly learning a policy together with a martingale certificate that proves its stability using a single learning algorithm. We observe that the joint optimization problem becomes easily stuck in local minima when starting from a randomly initialized policy. Our results suggest that some form of pre-training of the policy is required for the joint optimization to repair and verify the policy successfully.
    Beyond Separability: Analyzing the Linear Transferability of Contrastive Representations to Related Subpopulations. (arXiv:2204.02683v2 [cs.LG] UPDATED)
    Contrastive learning is a highly effective method for learning representations from unlabeled data. Recent works show that contrastive representations can transfer across domains, leading to simple state-of-the-art algorithms for unsupervised domain adaptation. In particular, a linear classifier trained to separate the representations on the source domain can also predict classes on the target domain accurately, even though the representations of the two domains are far from each other. We refer to this phenomenon as linear transferability. This paper analyzes when and why contrastive representations exhibit linear transferability in a general unsupervised domain adaptation setting. We prove that linear transferability can occur when data from the same class in different domains (e.g., photo dogs and cartoon dogs) are more related with each other than data from different classes in different domains (e.g., photo dogs and cartoon cats) are. Our analyses are in a realistic regime where the source and target domains can have unbounded density ratios and be weakly related, and they have distant representations across domains.
    Highly Accurate FMRI ADHD Classification using time distributed multi modal 3D CNNs. (arXiv:2205.11993v1 [cs.LG])
    This work proposes an algorithm for fMRI data analysis for the classification of ADHD disorders. There have been several breakthroughs in the analysis of fMRI via 3D convolutional neural networks (CNNs). With these new techniques it is possible to preserve the 3D spatial data of fMRI data. Additionally there have been recent advances in the use of 3D generative adversarial neural networks (GANs) for the generation of normal MRI data. This work utilizes multi modal 3D CNNs with data augmentation from 3D GAN for ADHD prediction from fMRI. By leveraging a 3D-GAN it would be possible to use deepfake data to enhance the accuracy of 3D CNN classification of brain disorders. A comparison will be made between a time distributed single modal 3D CNN model for classification and the modified multi modal model with MRI data as well.
    PatchNR: Learning from Small Data by Patch Normalizing Flow Regularization. (arXiv:2205.12021v1 [cs.LG])
    Learning neural networks using only a small amount of data is an important ongoing research topic with tremendous potential for applications. In this paper, we introduce a regularizer for the variational modeling of inverse problems in imaging based on normalizing flows. Our regularizer, called patchNR, involves a normalizing flow learned on patches of very few images. The subsequent reconstruction method is completely unsupervised and the same regularizer can be used for different forward operators acting on the same class of images. By investigating the distribution of patches versus those of the whole image class, we prove that our variational model is indeed a MAP approach. Our model can be generalized to conditional patchNRs, if additional supervised information is available. Numerical examples for low-dose CT, limited-angle CT and superresolution of material images demonstrate that our method provides high quality results among unsupervised methods, but requires only few data.
    Improving Human Image Synthesis with Residual Fast Fourier Transformation and Wasserstein Distance. (arXiv:2205.12022v1 [cs.CV])
    With the rapid development of the Metaverse, virtual humans have emerged, and human image synthesis and editing techniques, such as pose transfer, have recently become popular. Most of the existing techniques rely on GANs, which can generate good human images even with large variants and occlusions. But from our best knowledge, the existing state-of-the-art method still has the following problems: the first is that the rendering effect of the synthetic image is not realistic, such as poor rendering of some regions. And the second is that the training of GAN is unstable and slow to converge, such as model collapse. Based on the above two problems, we propose several methods to solve them. To improve the rendering effect, we use the Residual Fast Fourier Transform Block to replace the traditional Residual Block. Then, spectral normalization and Wasserstein distance are used to improve the speed and stability of GAN training. Experiments demonstrate that the methods we offer are effective at solving the problems listed above, and we get state-of-the-art scores in LPIPS and PSNR.
    FedEntropy: Efficient Device Grouping for Federated Learning Using Maximum Entropy Judgment. (arXiv:2205.12038v1 [cs.LG])
    Along with the popularity of Artificial Intelligence (AI) and Internet-of-Things (IoT), Federated Learning (FL) has attracted steadily increasing attentions as a promising distributed machine learning paradigm, which enables the training of a central model on for numerous decentralized devices without exposing their privacy. However, due to the biased data distributions on involved devices, FL inherently suffers from low classification accuracy in non-IID scenarios. Although various device grouping method have been proposed to address this problem, most of them neglect both i) distinct data distribution characteristics of heterogeneous devices, and ii) contributions and hazards of local models, which are extremely important in determining the quality of global model aggregation. In this paper, we present an effective FL method named FedEntropy with a novel dynamic device grouping scheme, which makes full use of the above two factors based on our proposed maximum entropy judgement heuristic.Unlike existing FL methods that directly aggregate local models returned from all the selected devices, in one FL round FedEntropy firstly makes a judgement based on the pre-collected soft labels of selected devices and then only aggregates the local models that can maximize the overall entropy of these soft labels. Without collecting local models that are harmful for aggregation, FedEntropy can effectively improve global model accuracy while reducing the overall communication overhead. Comprehensive experimental results on well-known benchmarks show that, FedEntropy not only outperforms state-of-the-art FL methods in terms of model accuracy and communication overhead, but also can be integrated into them to enhance their classification performance.
    Naive Few-Shot Learning: Sequence Consistency Evaluation. (arXiv:2205.12013v1 [cs.AI])
    Cognitive psychologists often use the term $\textit{fluid intelligence}$ to describe the ability of humans to solve novel tasks without any prior training. In contrast to humans, deep neural networks can perform cognitive tasks only after extensive (pre-)training with a large number of relevant examples. Motivated by fluid intelligence research in the cognitive sciences, we built a benchmark task which we call sequence consistency evaluation (SCE) that can be used to address this gap. Solving the SCE task requires the ability to extract simple rules from sequences, a basic computation that is required for solving various intelligence tests in humans. We tested $\textit{untrained}$ (naive) deep learning models in the SCE task. Specifically, we compared Relation Networks (RN) and Contrastive Predictive Coding (CPC), two models that can extract simple rules from sequences, and found that the latter, which imposes a structure on the predictable rule does better. We further found that simple networks fare better in this task than complex ones. Finally, we show that this approach can be used for security camera anomaly detection without any prior training.
    Theoretical Analysis of Primal-Dual Algorithm for Non-Convex Stochastic Decentralized Optimization. (arXiv:2205.11979v1 [math.OC])
    In recent years, decentralized learning has emerged as a powerful tool not only for large-scale machine learning, but also for preserving privacy. One of the key challenges in decentralized learning is that the data distribution held by each node is statistically heterogeneous. To address this challenge, the primal-dual algorithm called the Edge-Consensus Learning (ECL) was proposed and was experimentally shown to be robust to the heterogeneity of data distributions. However, the convergence rate of the ECL is provided only when the objective function is convex, and has not been shown in a standard machine learning setting where the objective function is non-convex. Furthermore, the intuitive reason why the ECL is robust to the heterogeneity of data distributions has not been investigated. In this work, we first investigate the relationship between the ECL and Gossip algorithm and show that the update formulas of the ECL can be regarded as correcting the local stochastic gradient in the Gossip algorithm. Then, we propose the Generalized ECL (G-ECL), which contains the ECL as a special case, and provide the convergence rates of the G-ECL in both (strongly) convex and non-convex settings, which do not depend on the heterogeneity of data distributions. Through synthetic experiments, we demonstrate that the numerical results of both the G-ECL and ECL coincide with the convergence rate of the G-ECL.
    Can Adversarial Training Be Manipulated By Non-Robust Features?. (arXiv:2201.13329v2 [cs.LG] UPDATED)
    Adversarial training, originally designed to resist test-time adversarial examples, has shown to be promising in mitigating training-time availability attacks. This defense ability, however, is challenged in this paper. We identify a novel threat model named stability attacks, which aims to hinder robust availability by slightly manipulating the training data. Under this threat, we show that adversarial training using a conventional defense budget $\epsilon$ provably fails to provide test robustness in a simple statistical setting, where the non-robust features of the training data can be reinforced by $\epsilon$-bounded perturbation. Further, we analyze the necessity of enlarging the defense budget to counter stability attacks. Finally, comprehensive experiments demonstrate that stability attacks are harmful on benchmark datasets, and thus the adaptive defense is necessary to maintain robustness.
    3D helical CT reconstruction with memory efficient invertible Learned Primal-Dual method. (arXiv:2205.11952v1 [eess.IV])
    Helical acquisition geometry is the most common geometry used in computed tomography (CT) scanners for medical imaging. We adapt the invertible Learned Primal-Dual (iLPD) deep neural network architecture so that it can be applied to helical 3D CT reconstruction. We achieve this by splitting the geometry and the data in parts that fit the memory and by splitting images into corresponding sub-volumes. The architecture can be applied to images different in size along the rotation axis. We perform the experiments on tomographic data simulated from realistic helical geometries.
    Realization Theory Of Recurrent Neural ODEs Using Polynomial System Embeddings. (arXiv:2205.11989v1 [math.OC])
    In this paper we show that neural ODE analogs of recurrent (ODE-RNN) and Long Short-Term Memory (ODE-LSTM) networks can be algorithmically embeddeded into the class of polynomial systems. This embedding preserves input-output behavior and can suitably be extended to other neural DE architectures. We then use realization theory of polynomial systems to provide necessary conditions for an input-output map to be realizable by an ODE-LSTM and sufficient conditions for minimality of such systems. These results represent the first steps towards realization theory of recurrent neural ODE architectures, which is is expected be useful for model reduction and learning algorithm analysis of recurrent neural ODEs.
    How Human is Human Evaluation? Improving the Gold Standard for NLG with Utility Theory. (arXiv:2205.11930v1 [cs.CL])
    Human ratings are treated as the gold standard in NLG evaluation. The standard protocol is to collect ratings of generated text, average across annotators, and then rank NLG systems by their average scores. However, little consideration has been given as to whether this approach faithfully captures human preferences. In this work, we analyze this standard protocol through the lens of utility theory in economics. We first identify the implicit assumptions it makes about annotators and find that these assumptions are often violated in practice, in which case annotator ratings become an unfaithful reflection of their preferences. The most egregious violations come from using Likert scales, which provably reverse the direction of the true preference in certain cases. We suggest improvements to the standard protocol to make it more theoretically sound, but even in its improved form, it cannot be used to evaluate open-ended tasks like story generation. For the latter, we propose a new evaluation protocol called $\textit{system-level probabilistic assessment}$ (SPA). In our experiments, we find that according to SPA, annotators prefer larger GPT-3 variants to smaller ones -- as expected -- with all comparisons being statistically significant. In contrast, the standard protocol only yields significant results half the time.
    Bandwidth Selection for Gaussian Kernel Ridge Regression via Jacobian Control. (arXiv:2205.11956v1 [stat.ML])
    Most machine learning methods depend on the tuning of hyper-parameters. For kernel ridge regression (KRR) with the Gaussian kernel, the hyper-parameter is the bandwidth. The bandwidth specifies the length-scale of the kernel and has to be carefully selected in order to obtain a model with good generalization. The default method for bandwidth selection is cross-validation, which often yields good results, albeit at high computational costs. Furthermore, the estimates provided by cross-validation tend to have very high variance, especially when training data are scarce. Inspired by Jacobian regularization, we formulate how the derivatives of the functions inferred by KRR with the Gaussian kernel depend on the kernel bandwidth. We then use this expression to propose a closed-form, computationally feather-light, bandwidth selection method based on controlling the Jacobian. In addition, the Jacobian expression illuminates how the bandwidth selection is a trade-off between the smoothness of the inferred function, and the conditioning of the training data kernel matrix. We show on real and synthetic data that compared to cross-validation, our method is considerably more stable in terms of bandwidth selection, and, for small data sets, provides better predictions.
    Assessing the Quality of Computational Notebooks for a Frictionless Transition from Exploration to Production. (arXiv:2205.11941v1 [cs.SE])
    The massive trend of integrating data-driven AI capabilities into traditional software systems is rising new intriguing challenges. One of such challenges is achieving a smooth transition from the explorative phase of Machine Learning projects - in which data scientists build prototypical models in the lab - to their production phase - in which software engineers translate prototypes into production-ready AI components. To narrow down the gap between these two phases, tools and practices adopted by data scientists might be improved by incorporating consolidated software engineering solutions. In particular, computational notebooks have a prominent role in determining the quality of data science prototypes. In my research project, I address this challenge by studying the best practices for collaboration with computational notebooks and proposing proof-of-concept tools to foster guidelines compliance.
    Estimation of Convex Polytopes for Automatic Discovery of Charge State Transitions in Quantum Dot Arrays. (arXiv:2108.09133v2 [cs.LG] UPDATED)
    In spin based quantum dot arrays, material or fabrication imprecisions affect the behaviour of the device, which must be taken into account when controlling it. This requires measuring the shape of specific convex polytopes. In this work, we present an algorithm that automatically discovers count, shape and size of the facets of a convex polytope from measurements. Results on simulated devices as well as a real 2x2 spin qubit array show that we can reliably find the facets of the convex polytopes, including small facets with sizes on the order of the measurement precision.
    A Data-Centric Optimization Framework for Machine Learning. (arXiv:2110.10802v2 [cs.LG] UPDATED)
    Rapid progress in deep learning is leading to a diverse set of quickly changing models, with a dramatically growing demand for compute. However, as frameworks specialize performance optimization to patterns in popular networks, they implicitly constrain novel and diverse models that drive progress in research. We empower deep learning researchers by defining a flexible and user-customizable pipeline for optimizing training of arbitrary deep neural networks, based on data movement minimization. The pipeline begins with standard networks in PyTorch or ONNX and transforms computation through progressive lowering. We define four levels of general-purpose transformations, from local intra-operator optimizations to global data movement reduction. These operate on a data-centric graph intermediate representation that expresses computation and data movement at all levels of abstraction, including expanding basic operators such as convolutions to their underlying computations. Central to the design is the interactive and introspectable nature of the pipeline. Every part is extensible through a Python API, and can be tuned interactively using a GUI. We demonstrate competitive performance or speedups on ten different networks, with interactive optimizations discovering new opportunities in EfficientNet.
    MS-nowcasting: Operational Precipitation Nowcasting with Convolutional LSTMs at Microsoft Weather. (arXiv:2111.09954v2 [cs.LG] UPDATED)
    We present the encoder-forecaster convolutional long short-term memory (LSTM) deep-learning model that powers Microsoft Weather's operational precipitation nowcasting product. This model takes as input a sequence of weather radar mosaics and deterministically predicts future radar reflectivity at lead times up to 6 hours. By stacking a large input receptive field along the feature dimension and conditioning the model's forecaster with predictions from the physics-based High Resolution Rapid Refresh (HRRR) model, we are able to outperform optical flow and HRRR baselines by 20-25% on multiple metrics averaged over all lead times.
    Predicting Physics in Mesh-reduced Space with Temporal Attention. (arXiv:2201.09113v3 [cs.LG] UPDATED)
    Graph-based next-step prediction models have recently been very successful in modeling complex high-dimensional physical systems on irregular meshes. However, due to their short temporal attention span, these models suffer from error accumulation and drift. In this paper, we propose a new method that captures long-term dependencies through a transformer-style temporal attention model. We introduce an encoder-decoder structure to summarize features and create a compact mesh representation of the system state, to allow the temporal model to operate on a low-dimensional mesh representations in a memory efficient manner. Our method outperforms a competitive GNN baseline on several complex fluid dynamics prediction tasks, from sonic shocks to vascular flow. We demonstrate stable rollouts without the need for training noise and show perfectly phase-stable predictions even for very long sequences. More broadly, we believe our approach paves the way to bringing the benefits of attention-based sequence models to solving high-dimensional complex physics tasks.
    From Predictions to Decisions: The Importance of Joint Predictive Distributions. (arXiv:2107.09224v3 [cs.LG] UPDATED)
    A fundamental challenge for any intelligent system is prediction: given some inputs, can you predict corresponding outcomes? Most work on supervised learning has focused on producing accurate marginal predictions for each input. However, we show that for a broad class of decision problems, accurate joint predictions are required to deliver good performance. In particular, we establish several results pertaining to combinatorial decision problems, sequential predictions, and multi-armed bandits to elucidate the essential role of joint predictive distributions. Our treatment of multi-armed bandits introduces an approximate Thompson sampling algorithm and analytic techniques that lead to a new kind of regret bound.
    Learning Interacting Dynamical Systems with Latent Gaussian Process ODEs. (arXiv:2205.11894v1 [cs.LG])
    We study for the first time uncertainty-aware modeling of continuous-time dynamics of interacting objects. We introduce a new model that decomposes independent dynamics of single objects accurately from their interactions. By employing latent Gaussian process ordinary differential equations, our model infers both independent dynamics and their interactions with reliable uncertainty estimates. In our formulation, each object is represented as a graph node and interactions are modeled by accumulating the messages coming from neighboring objects. We show that efficient inference of such a complex network of variables is possible with modern variational sparse Gaussian process inference techniques. We empirically demonstrate that our model improves the reliability of long-term predictions over neural network based alternatives and it successfully handles missing dynamic or static information. Furthermore, we observe that only our model can successfully encapsulate independent dynamics and interaction information in distinct functions and show the benefit from this disentanglement in extrapolation scenarios.
    Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision. (arXiv:2205.11913v1 [cs.DC])
    Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU accelerators have been collectively constructed into a GPU datacenter. An efficient scheduler design for such GPU datacenter is crucially important to reduce the operational cost and improve resource utilization. However, traditional approaches designed for big data or high performance computing workloads can not support DL workloads to fully utilize the GPU resources. Recently, substantial schedulers are proposed to tailor for DL workloads in GPU datacenters. This paper surveys existing research efforts for both training and inference workloads. We primarily present how existing schedulers facilitate the respective workloads from the scheduling objectives and resource consumption features. Finally, we prospect several promising future research directions. More detailed summary with the surveyed paper and code links can be found at our project website: https://github.com/S-Lab-SystemGroup/Awesome-DL-Scheduling-Papers
    Large Language Models are Zero-Shot Reasoners. (arXiv:2205.11916v1 [cs.CL])
    Pretrained large language models (LLMs) are widely used in many sub-fields of natural language processing (NLP) and generally known as excellent few-shot learners with task-specific exemplars. Notably, chain of thought (CoT) prompting, a recent technique for eliciting complex multi-step reasoning through step-by-step answer examples, achieved the state-of-the-art performances in arithmetics and symbolic reasoning, difficult system-2 tasks that do not follow the standard scaling laws for LLMs. While these successes are often attributed to LLMs' ability for few-shot learning, we show that LLMs are decent zero-shot reasoners by simply adding ``Let's think step by step'' before each answer. Experimental results demonstrate that our Zero-shot-CoT, using the same single prompt template, significantly outperforms zero-shot LLM performances on diverse benchmark reasoning tasks including arithmetics (MultiArith, GSM8K, AQUA-RAT, SVAMP), symbolic reasoning (Last Letter, Coin Flip), and other logical reasoning tasks (Date Understanding, Tracking Shuffled Objects), without any hand-crafted few-shot examples, e.g. increasing the accuracy on MultiArith from 17.7% to 78.7% and GSM8K from 10.4% to 40.7% with an off-the-shelf 175B parameter model. The versatility of this single prompt across very diverse reasoning tasks hints at untapped and understudied fundamental zero-shot capabilities of LLMs, suggesting high-level, multi-task broad cognitive capabilities may be extracted through simple prompting. We hope our work not only serves as the minimal strongest zero-shot baseline for the challenging reasoning benchmarks, but also highlights the importance of carefully exploring and analyzing the enormous zero-shot knowledge hidden inside LLMs before crafting finetuning datasets or few-shot exemplars.
    Physics-Embedded Neural Networks: $\boldsymbol{\mathrm{E}(n)}$-Equivariant Graph Neural PDE Solvers. (arXiv:2205.11912v1 [cs.LG])
    Graph neural network (GNN) is a promising approach to learning and predicting physical phenomena described in boundary value problems, such as partial differential equations (PDEs) with boundary conditions. However, existing models inadequately treat boundary conditions essential for the reliable prediction of such problems. In addition, because of the locally connected nature of GNNs, it is difficult to accurately predict the state after a long time, where interaction between vertices tends to be global. We present our approach termed physics-embedded neural networks that considers boundary conditions and predicts the state after a long time using an implicit method. It is built based on an $\mathrm{E}(n)$-equivariant GNN, resulting in high generalization performance on various shapes. We demonstrate that our model learns flow phenomena in complex shapes and outperforms a well-optimized classical solver and a state-of-the-art machine learning model in speed-accuracy trade-off. Therefore, our model can be a useful standard for realizing reliable, fast, and accurate GNN-based PDE solvers.
    The Data-Production Dispositif. (arXiv:2205.11963v1 [cs.HC])
    Machine learning (ML) depends on data to train and verify models. Very often, organizations outsource processes related to data work (i.e., generating and annotating data and evaluating outputs) through business process outsourcing (BPO) companies and crowdsourcing platforms. This paper investigates outsourced ML data work in Latin America by studying three platforms in Venezuela and a BPO in Argentina. We lean on the Foucauldian notion of dispositif to define the data-production dispositif as an ensemble of discourses, actions, and objects strategically disposed to (re)produce power/knowledge relations in data and labor. Our dispositif analysis comprises the examination of 210 data work instruction documents, 55 interviews with data workers, managers, and requesters, and participant observation. Our findings show that discourses encoded in instructions reproduce and normalize the worldviews of requesters. Precarious working conditions and economic dependency alienate workers, making them obedient to instructions. Furthermore, discourses and social contexts materialize in artifacts, such as interfaces and performance metrics, limiting workers' agency and normalizing specific ways of interpreting data. We conclude by stressing the importance of counteracting the data-production dispositif by fighting alienation and precarization, and empowering data workers to become assets in the quest for high-quality data.
    Multi-Agent Collaborative Inference via DNN Decoupling: Intermediate Feature Compression and Edge Learning. (arXiv:2205.11854v1 [cs.LG])
    Recently, deploying deep neural network (DNN) models via collaborative inference, which splits a pre-trained model into two parts and executes them on user equipment (UE) and edge server respectively, becomes attractive. However, the large intermediate feature of DNN impedes flexible decoupling, and existing approaches either focus on the single UE scenario or simply define tasks considering the required CPU cycles, but ignore the indivisibility of a single DNN layer. In this paper, we study the multi-agent collaborative inference scenario, where a single edge server coordinates the inference of multiple UEs. Our goal is to achieve fast and energy-efficient inference for all UEs. To achieve this goal, we first design a lightweight autoencoder-based method to compress the large intermediate feature. Then we define tasks according to the inference overhead of DNNs and formulate the problem as a Markov decision process (MDP). Finally, we propose a multi-agent hybrid proximal policy optimization (MAHPPO) algorithm to solve the optimization problem with a hybrid action space. We conduct extensive experiments with different types of networks, and the results show that our method can reduce up to 56\% of inference latency and save up to 72\% of energy consumption.
    Causal Influences Decouple From Their Underlying Network Structure In Echo State Networks. (arXiv:2205.11947v1 [cs.LG])
    Echo State Networks (ESN) are versatile recurrent neural network models in which the hidden layer remains unaltered during training. Interactions among nodes of this static backbone produce diverse representations of the given stimuli that are harnessed by a read-out mechanism to perform computations needed for solving a given task. ESNs are accessible models of neuronal circuits, since they are relatively inexpensive to train. Therefore, ESNs have become attractive for neuroscientists studying the relationship between neural structure, function, and behavior. For instance, it is not yet clear how distinctive connectivity patterns of brain networks support effective interactions among their nodes and how these patterns of interactions give rise to computation. To address this question, we employed an ESN with a biologically inspired structure and used a systematic multi-site lesioning framework to quantify the causal contribution of each node to the network's output, thus providing a causal link between network structure and behavior. We then focused on the structure-function relationship and decomposed the causal influence of each node on all other nodes, using the same lesioning framework. We found that nodes in a properly engineered ESN interact largely irrespective of the network's underlying structure. However, in a network with the same topology and a non-optimal parameter set, the underlying connectivity patterns determine the node interactions. Our results suggest that causal structure-function relations in ESNs can be decomposed into two components, direct and indirect interactions. The former are based on influences relying on structural connections. The latter describe the effective communication between any two nodes through other intermediate nodes. These widely distributed indirect interactions may crucially contribute to the efficient performance of ESNs.
    An Adaptive Contrastive Learning Model for Spike Sorting. (arXiv:2205.11914v1 [cs.LG])
    Brain-computer interfaces (BCIs), is ways for electronic devices to communicate directly with the brain. For most medical-type brain-computer interface tasks, the activity of multiple units of neurons or local field potentials is sufficient for decoding. But for BCIs used in neuroscience research, it is important to separate out the activity of individual neurons. With the development of large-scale silicon technology and the increasing number of probe channels, artificially interpreting and labeling spikes is becoming increasingly impractical. In this paper, we propose a novel modeling framework: Adaptive Contrastive Learning Model that learns representations from spikes through contrastive learning based on the maximizing mutual information loss function as a theoretical basis. Based on the fact that data with similar features share the same labels whether they are multi-classified or binary-classified. With this theoretical support, we simplify the multi-classification problem into multiple binary-classification, improving both the accuracy and the runtime efficiency. Moreover, we also introduce a series of enhancements for the spikes, while solving the problem that the classification effect is affected because of the overlapping spikes.
    A Quadrature Rule combining Control Variates and Adaptive Importance Sampling. (arXiv:2205.11890v1 [stat.ML])
    Driven by several successful applications such as in stochastic gradient descent or in Bayesian computation, control variates have become a major tool for Monte Carlo integration. However, standard methods do not allow the distribution of the particles to evolve during the algorithm, as is the case in sequential simulation methods. Within the standard adaptive importance sampling framework, a simple weighted least squares approach is proposed to improve the procedure with control variates. The procedure takes the form of a quadrature rule with adapted quadrature weights to reflect the information brought in by the control variates. The quadrature points and weights do not depend on the integrand, a computational advantage in case of multiple integrands. Moreover, the target density needs to be known only up to a multiplicative constant. Our main result is a non-asymptotic bound on the probabilistic error of the procedure. The bound proves that for improving the estimate's accuracy, the benefits from adaptive importance sampling and control variates can be combined. The good behavior of the method is illustrated empirically on synthetic examples and real-world data for Bayesian linear regression.
    Advanced Manufacturing Configuration by Sample-efficient Batch Bayesian Optimization. (arXiv:2205.11827v1 [cs.LG])
    We propose a framework for the configuration and operation of expensive-to-evaluate advanced manufacturing methods, based on Bayesian optimization. The framework unifies a tailored acquisition function, a parallel acquisition procedure, and the integration of process information providing context to the optimization procedure. The novel acquisition function is demonstrated and analyzed on benchmark illustrative problems. We apply the optimization approach to atmospheric plasma spraying in simulation and experiments. Our results demonstrate that the proposed framework can efficiently find input parameters that produce the desired outcome and minimize the process cost.
    Accelerating Frank-Wolfe via Averaging Step Directions. (arXiv:2205.11794v1 [math.OC])
    The Frank-Wolfe method is a popular method in sparse constrained optimization, due to its fast per-iteration complexity. However, the tradeoff is that its worst case global convergence is comparatively slow, and importantly, is fundamentally slower than its flow rate--that is to say, the convergence rate is throttled by discretization error. In this work, we consider a modified Frank-Wolfe where the step direction is a simple weighted average of past oracle calls. This method requires very little memory and computational overhead, and provably decays this discretization error term. Numerically, we show that this method improves the convergence rate over several problems, especially after the sparse manifold has been detected. Theoretically, we show the method has an overall global convergence rate of $O(1/k^p)$, where $0< p < 1$; after manifold identification, this rate speeds to $O(1/k^{3p/2})$. We also observe that the method achieves this accelerated rate from a very early stage, suggesting a promising mode of acceleration for this family of methods.
    Why KDAC? A general activation function for knowledge discovery. (arXiv:2111.13858v4 [cs.LG] UPDATED)
    Deep learning oriented named entity recognition (DNER) has gradually become the paradigm of knowledge discovery, which greatly promotes domain intelligence. However, the current activation function of DNER fails to treat gradient vanishing, no negative output or non-differentiable existence, which may impede knowledge exploration caused by the omission and incomplete representation of latent semantics. To break through the dilemma, we present a novel activation function termed KDAC. Detailly, KDAC is an aggregation function with multiple conversion modes. The backbone of the activation region is the interaction between exponent and linearity, and the both ends extend through adaptive linear divergence, which surmounts the obstacle of gradient vanishing and no negative output. Crucially, the non-differentiable points are alerted and eliminated by an approximate smoothing algorithm. KDAC has a series of brilliant properties, including nonlinear, stable near-linear transformation and derivative, as well as dynamic style, etc. We perform experiments based on BERT-BiLSTM-CNN-CRF model on six benchmark datasets containing different domain knowledge, such as Weibo, Clinical, E-commerce, Resume, HAZOP and People's daily. The evaluation results show that KDAC is advanced and effective, and can provide more generalized activation to stimulate the performance of DNER. We hope that KDAC can be exploited as a promising activation function to devote itself to the construction of knowledge.
    Energy Forecasting in Smart Grid Systems: A Review of the State-of-the-art Techniques. (arXiv:2011.12598v3 [cs.LG] UPDATED)
    Energy forecasting has a vital role to play in smart grid (SG) systems involving various applications such as demand-side management, load shedding, and optimum dispatch. Managing efficient forecasting while ensuring the least possible prediction error is one of the main challenges posed in the grid today, considering the uncertainty and granularity in SG data. This paper presents a comprehensive and application-oriented review of state-of-the-art forecasting methods for SG systems along with recent developments in probabilistic deep learning (PDL) considering different models and architectures. Traditional point forecasting methods including statistical, machine learning (ML), and deep learning (DL) are extensively investigated in terms of their applicability to energy forecasting. In addition, the significance of hybrid and data pre-processing techniques to support forecasting performance is also studied. A comparative case study using the Victorian electricity consumption and American electric power (AEP) datasets is conducted to analyze the performance of point and probabilistic forecasting methods. The analysis demonstrates higher accuracy of the long-short term memory (LSTM) models with appropriate hyper-parameter tuning among point forecasting methods especially when sample sizes are larger and involve nonlinear patterns with long sequences. Furthermore, Bayesian bidirectional LSTM (BLSTM) as a probabilistic method exhibit the highest accuracy in terms of least pinball score and root mean square error (RMSE).
    Neural Distributed Source Coding. (arXiv:2106.02797v2 [cs.IT] UPDATED)
    Distributed source coding (DSC) is the task of encoding an input in the absence of correlated side information that is only available to the decoder. Remarkably, Slepian and Wolf showed in 1973 that an encoder without access to the side information can asymptotically achieve the same compression rate as when the side information is available to it. While there is vast prior work on this topic, practical DSC has been limited to synthetic datasets and specific correlation structures. Here we present a framework for lossy DSC that is agnostic to the correlation structure and can scale to high dimensions. Rather than relying on hand-crafted source-modeling, our method utilizes a conditional VQ-VAE to learn the distributed encoder and decoder. We evaluate our method on multiple datasets and show that our method can handle complex correlations -- significantly better than the current state-of-the-art method.
    Efficient and Robust Algorithms for Adversarial Linear Contextual Bandits. (arXiv:2002.00287v3 [cs.LG] UPDATED)
    We consider an adversarial variant of the classic $K$-armed linear contextual bandit problem where the sequence of loss functions associated with each arm are allowed to change without restriction over time. Under the assumption that the $d$-dimensional contexts are generated i.i.d.~at random from a known distributions, we develop computationally efficient algorithms based on the classic Exp3 algorithm. Our first algorithm, RealLinExp3, is shown to achieve a regret guarantee of $\widetilde{O}(\sqrt{KdT})$ over $T$ rounds, which matches the best available bound for this problem. Our second algorithm, RobustLinExp3, is shown to be robust to misspecification, in that it achieves a regret bound of $\widetilde{O}((Kd)^{1/3}T^{2/3}) + \varepsilon \sqrt{d} T$ if the true reward function is linear up to an additive nonlinear error uniformly bounded in absolute value by $\varepsilon$. To our knowledge, our performance guarantees constitute the very first results on this problem setting.
    Out-of-domain Detection for Natural Language Understanding in Dialog Systems. (arXiv:1909.03862v4 [cs.CL] UPDATED)
    Natural Language Understanding (NLU) is a vital component of dialogue systems, and its ability to detect Out-of-Domain (OOD) inputs is critical in practical applications, since the acceptance of the OOD input that is unsupported by the current system may lead to catastrophic failure. However, most existing OOD detection methods rely heavily on manually labeled OOD samples and cannot take full advantage of unlabeled data. This limits the feasibility of these models in practical applications. In this paper, we propose a novel model to generate high-quality pseudo OOD samples that are akin to IN-Domain (IND) input utterances, and thereby improves the performance of OOD detection. To this end, an autoencoder is trained to map an input utterance into a latent code. and the codes of IND and OOD samples are trained to be indistinguishable by utilizing a generative adversarial network. To provide more supervision signals, an auxiliary classifier is introduced to regularize the generated OOD samples to have indistinguishable intent labels. Experiments show that these pseudo OOD samples generated by our model can be used to effectively improve OOD detection in NLU. Besides, we also demonstrate that the effectiveness of these pseudo OOD data can be further improved by efficiently utilizing unlabeled data.
    CDFKD-MFS: Collaborative Data-free Knowledge Distillation via Multi-level Feature Sharing. (arXiv:2205.11845v1 [cs.CV])
    Recently, the compression and deployment of powerful deep neural networks (DNNs) on resource-limited edge devices to provide intelligent services have become attractive tasks. Although knowledge distillation (KD) is a feasible solution for compression, its requirement on the original dataset raises privacy concerns. In addition, it is common to integrate multiple pretrained models to achieve satisfactory performance. How to compress multiple models into a tiny model is challenging, especially when the original data are unavailable. To tackle this challenge, we propose a framework termed collaborative data-free knowledge distillation via multi-level feature sharing (CDFKD-MFS), which consists of a multi-header student module, an asymmetric adversarial data-free KD module, and an attention-based aggregation module. In this framework, the student model equipped with a multi-level feature-sharing structure learns from multiple teacher models and is trained together with a generator in an asymmetric adversarial manner. When some real samples are available, the attention module adaptively aggregates predictions of the student headers, which can further improve performance. We conduct extensive experiments on three popular computer visual datasets. In particular, compared with the most competitive alternative, the accuracy of the proposed framework is 1.18\% higher on the CIFAR-100 dataset, 1.67\% higher on the Caltech-101 dataset, and 2.99\% higher on the mini-ImageNet dataset.
    Learning to Assemble Geometric Shapes. (arXiv:2205.11809v1 [cs.CV])
    Assembling parts into an object is a combinatorial problem that arises in a variety of contexts in the real world and involves numerous applications in science and engineering. Previous related work tackles limited cases with identical unit parts or jigsaw-style parts of textured shapes, which greatly mitigate combinatorial challenges of the problem. In this work, we introduce the more challenging problem of shape assembly, which involves textureless fragments of arbitrary shapes with indistinctive junctions, and then propose a learning-based approach to solving it. We demonstrate the effectiveness on shape assembly tasks with various scenarios, including the ones with abnormal fragments (e.g., missing and distorted), the different number of fragments, and different rotation discretization.
    Attributing AUC-ROC to Analyze Binary Classifier Performance. (arXiv:2205.11781v1 [cs.LG])
    Area Under the Receiver Operating Characteristic Curve (AUC-ROC) is a popular evaluation metric for binary classifiers. In this paper, we discuss techniques to segment the AUC-ROC along human-interpretable dimensions. AUC-ROC is not an additive/linear function over the data samples, therefore such segmenting the overall AUC-ROC is different from tabulating the AUC-ROC of data segments. To segment the overall AUC-ROC, we must first solve an \emph{attribution} problem to identify credit for individual examples. We observe that AUC-ROC, though non-linear over examples, is linear over \emph{pairs} of examples. This observation leads to a simple, efficient attribution technique for examples (example attributions), and for pairs of examples (pair attributions). We automatically slice these attributions using decision trees by making the tree predict the attributions; we use the notion of honest estimates along with a t-test to mitigate false discovery. Our experiments with the method show that an inferior model can outperform a superior model (trained to optimize a different training objective) on the inferior model's own training objective, a manifestation of Goodhart's Law. In contrast, AUC attributions enable a reasonable comparison. Example attributions can be used to slice this comparison. Pair attributions are used to categorize pairs of items -- one positively labeled and one negatively -- that the model has trouble separating. These categories identify the decision boundary of the classifier and the headroom to improve AUC.
    Penalized Proximal Policy Optimization for Safe Reinforcement Learning. (arXiv:2205.11814v1 [cs.LG])
    Safe reinforcement learning aims to learn the optimal policy while satisfying safety constraints, which is essential in real-world applications. However, current algorithms still struggle for efficient policy updates with hard constraint satisfaction. In this paper, we propose Penalized Proximal Policy Optimization (P3O), which solves the cumbersome constrained policy iteration via a single minimization of an equivalent unconstrained problem. Specifically, P3O utilizes a simple-yet-effective penalty function to eliminate cost constraints and removes the trust-region constraint by the clipped surrogate objective. We theoretically prove the exactness of the proposed method with a finite penalty factor and provide a worst-case analysis for approximate error when evaluated on sample trajectories. Moreover, we extend P3O to more challenging multi-constraint and multi-agent scenarios which are less studied in previous work. Extensive experiments show that P3O outperforms state-of-the-art algorithms with respect to both reward improvement and constraint satisfaction on a set of constrained locomotive tasks.
    NFL: Robust Learned Index via Distribution Transformation. (arXiv:2205.11807v1 [cs.DB])
    Recent works on learned index open a new direction for the indexing field. The key insight of the learned index is to approximate the mapping between keys and positions with piece-wise linear functions. Such methods require partitioning key space for a better approximation. Although lots of heuristics are proposed to improve the approximation quality, the bottleneck is that the segmentation overheads could hinder the overall performance. This paper tackles the approximation problem by applying a \textit{distribution transformation} to the keys before constructing the learned index. A two-stage Normalizing-Flow-based Learned index framework (NFL) is proposed, which first transforms the original complex key distribution into a near-uniform distribution, then builds a learned index leveraging the transformed keys. For effective distribution transformation, we propose a Numerical Normalizing Flow (Numerical NF). Based on the characteristics of the transformed keys, we propose a robust After-Flow Learned Index (AFLI). To validate the performance, comprehensive evaluations are conducted on both synthetic and real-world workloads, which shows that the proposed NFL produces the highest throughput and the lowest tail latency compared to the state-of-the-art learned indexes.
    Diverse Lottery Tickets Boost Ensemble from a Single Pretrained Model. (arXiv:2205.11833v1 [cs.LG])
    Ensembling is a popular method used to improve performance as a last resort. However, ensembling multiple models finetuned from a single pretrained model has been not very effective; this could be due to the lack of diversity among ensemble members. This paper proposes Multi-Ticket Ensemble, which finetunes different subnetworks of a single pretrained model and ensembles them. We empirically demonstrated that winning-ticket subnetworks produced more diverse predictions than dense networks, and their ensemble outperformed the standard ensemble on some tasks.
    Alleviating Robust Overfitting of Adversarial Training With Consistency Regularization. (arXiv:2205.11744v1 [cs.LG])
    Adversarial training (AT) has proven to be one of the most effective ways to defend Deep Neural Networks (DNNs) against adversarial attacks. However, the phenomenon of robust overfitting, i.e., the robustness will drop sharply at a certain stage, always exists during AT. It is of great importance to decrease this robust generalization gap in order to obtain a robust model. In this paper, we present an in-depth study towards the robust overfitting from a new angle. We observe that consistency regularization, a popular technique in semi-supervised learning, has a similar goal as AT and can be used to alleviate robust overfitting. We empirically validate this observation, and find a majority of prior solutions have implicit connections to consistency regularization. Motivated by this, we introduce a new AT solution, which integrates the consistency regularization and Mean Teacher (MT) strategy into AT. Specifically, we introduce a teacher model, coming from the average weights of the student models over the training steps. Then we design a consistency loss function to make the prediction distribution of the student models over adversarial examples consistent with that of the teacher model over clean samples. Experiments show that our proposed method can effectively alleviate robust overfitting and improve the robustness of DNN models against common adversarial attacks.
    Wireless Ad Hoc Federated Learning: A Fully Distributed Cooperative Machine Learning. (arXiv:2205.11779v1 [cs.LG])
    Federated learning has allowed training of a global model by aggregating local models trained on local nodes. However, it still takes client-server model, which can be further distributed, fully decentralized, or even partially connected, or totally opportunistic. In this paper, we propose a wireless ad hoc federated learning (WAFL) -- a fully distributed cooperative machine learning organized by the nodes physically nearby. Here, each node has a wireless interface and can communicate with each other when they are within the radio range. The nodes are expected to move with people, vehicles, or robots, producing opportunistic contacts with each other. In WAFL, each node trains a model individually with the local data it has. When a node encounter with others, they exchange their trained models, and generate new aggregated models, which are expected to be more general compared to the locally trained models on Non-IID data. For evaluation, we have prepared four static communication networks and two types of dynamic and opportunistic communication networks based on random waypoint mobility and community-structured environment, and then studied the training process of a fully connected neural network with 90% Non-IID MNIST dataset. The evaluation results indicate that WAFL allowed the convergence of model parameters among the nodes toward generalization, even with opportunistic node contact scenarios -- whereas in self-training (or lonely training) case, they have diverged. This WAFL's model generalization contributed to achieving higher accuracy 94.7-96.2% to the testing IID dataset compared to the self-training case 84.7%.
    Constrained Monotonic Neural Networks. (arXiv:2205.11775v1 [cs.LG])
    Deep neural networks are becoming increasingly popular in approximating arbitrary functions from noisy data. But wider adoption is being hindered by the need to explain such models and to impose additional constraints on them. Monotonicity constraint is one of the most requested properties in real-world scenarios and is the focus of this paper. One of the oldest ways to construct a monotonic fully connected neural network is to constrain its weights to be non-negative while employing a monotonic activation function. Unfortunately, this construction does not work with popular non-saturated activation functions such as ReLU, ELU, SELU etc, as it can only approximate convex functions. We show this shortcoming can be fixed by employing the original activation function for a part of the neurons in the layer, and employing its point reflection for the other part. Our experiments show this approach of building monotonic deep neural networks have matching or better accuracy when compared to other state-of-the-art methods such as deep lattice networks or monotonic networks obtained by heuristic regularization. This method is the simplest one in the sense of having the least number of parameters, not requiring any modifications to the learning procedure or steps post-learning steps.
    ItemSage: Learning Product Embeddings for Shopping Recommendations at Pinterest. (arXiv:2205.11728v1 [cs.IR])
    Learned embeddings for products are an important building block for web-scale e-commerce recommendation systems. At Pinterest, we build a single set of product embeddings called ItemSage to provide relevant recommendations in all shopping use cases including user, image and search based recommendations. This approach has led to significant improvements in engagement and conversion metrics, while reducing both infrastructure and maintenance cost. While most prior work focuses on building product embeddings from features coming from a single modality, we introduce a transformer-based architecture capable of aggregating information from both text and image modalities and show that it significantly outperforms single modality baselines. We also utilize multi-task learning to make ItemSage optimized for several engagement types, leading to a candidate generation system that is efficient for all of the engagement objectives of the end-to-end recommendation system. Extensive offline experiments are conducted to illustrate the effectiveness of our approach and results from online A/B experiments show substantial gains in key business metrics (up to +7% gross merchandise value/user and +11% click volume).
    On the Role of Bidirectionality in Language Model Pre-Training. (arXiv:2205.11726v1 [cs.CL])
    Prior work on language model pre-training has explored different architectures and learning objectives, but differences in data, hyperparameters and evaluation make a principled comparison difficult. In this work, we focus on bidirectionality as a key factor that differentiates existing approaches, and present a comprehensive study of its role in next token prediction, text infilling, zero-shot priming and fine-tuning. We propose a new framework that generalizes prior approaches, including fully unidirectional models like GPT, fully bidirectional models like BERT, and hybrid models like CM3 and prefix LM. Our framework distinguishes between two notions of bidirectionality (bidirectional context and bidirectional attention) and allows us to control each of them separately. We find that the optimal configuration is largely application-dependent (e.g., bidirectional attention is beneficial for fine-tuning and infilling, but harmful for next token prediction and zero-shot priming). We train models with up to 6.7B parameters, and find differences to remain consistent at scale. While prior work on scaling has focused on left-to-right autoregressive models, our results suggest that this approach comes with some trade-offs, and it might be worthwhile to develop very large bidirectional models.
    BabyBear: Cheap inference triage for expensive language models. (arXiv:2205.11747v1 [cs.CL])
    Transformer language models provide superior accuracy over previous models but they are computationally and environmentally expensive. Borrowing the concept of model cascading from computer vision, we introduce BabyBear, a framework for cascading models for natural language processing (NLP) tasks to minimize cost. The core strategy is inference triage, exiting early when the least expensive model in the cascade achieves a sufficiently high-confidence prediction. We test BabyBear on several open source data sets related to document classification and entity recognition. We find that for common NLP tasks a high proportion of the inference load can be accomplished with cheap, fast models that have learned by observing a deep learning model. This allows us to reduce the compute cost of large-scale classification jobs by more than 50% while retaining overall accuracy. For named entity recognition, we save 33% of the deep learning compute while maintaining an F1 score higher than 95% on the CoNLL benchmark.
    HiPAL: A Deep Framework for Physician Burnout Prediction Using Activity Logs in Electronic Health Records. (arXiv:2205.11680v1 [cs.LG])
    Burnout is a significant public health concern affecting nearly half of the healthcare workforce. This paper presents the first end-to-end deep learning framework for predicting physician burnout based on clinician activity logs, digital traces of their work activities, available in any electronic health record (EHR) system. In contrast to prior approaches that exclusively relied on surveys for burnout measurement, our framework directly learns deep workload representations from large-scale clinician activity logs to predict burnout. We propose the Hierarchical burnout Prediction based on Activity Logs (HiPAL), featuring a pre-trained time-dependent activity embedding mechanism tailored for activity logs and a hierarchical predictive model, which mirrors the natural hierarchical structure of clinician activity logs and captures physician's evolving workload patterns at both short-term and long-term levels. To utilize the large amount of unlabeled activity logs, we propose a semi-supervised framework that learns to transfer knowledge extracted from unlabeled clinician activities to the HiPAL-based prediction model. The experiment on over 15 million clinician activity logs collected from the EHR at a large academic medical center demonstrates the advantages of our proposed framework in predictive performance of physician burnout and training efficiency over state of the art approaches.
    Demand Response Method Considering Multiple Types of Flexible Loads in Industrial Parks. (arXiv:2205.11743v1 [eess.SY])
    With the rapid development of the energy internet, the proportion of flexible loads in smart grid is getting much higher than before. It is highly important to model flexible loads based on demand response. Therefore, a new demand response method considering multiple flexible loads is proposed in this paper to character the integrated demand response (IDR) resources. Firstly, a physical process analytical deduction (PPAD) model is proposed to improve the classification of flexible loads in industrial parks. Scenario generation, data point augmentation, and smooth curves under various operating conditions are considered to enhance the applicability of the model. Secondly, in view of the strong volatility and poor modeling effect of Wasserstein-generative adversarial networks (WGAN), an improved WGAN-gradient penalty (IWGAN-GP) model is developed to get a faster convergence speed than traditional WGAN and generate a higher quality samples. Finally, the PPAD and IWGAN-GP models are jointly implemented to reveal the degree of correlation between flexible loads. Meanwhile, an intelligent offline database is built to deal with the impact of nonlinear factors in different response scenarios. Numerical examples have been performed with the results proving that the proposed method is significantly better than the existing technologies in reducing load modeling deviation and improving the responsiveness of park loads.
    Prediction of the Position of External Markers Using a Recurrent Neural Network Trained With Unbiased Online Recurrent Optimization for Safe Lung Cancer Radiotherapy. (arXiv:2106.01100v3 [eess.IV] UPDATED)
    During lung radiotherapy, the position of infrared reflective objects on the chest can be recorded to estimate the tumor location. However, radiotherapy systems have a latency inherent to robot control limitations that impedes the radiation delivery precision. Prediction with online learning of recurrent neural networks (RNN) allows for adaptation to non-stationary respiratory signals, but classical methods such as RTRL and truncated BPTT are respectively slow and biased. This study investigates the capabilities of unbiased online recurrent optimization (UORO) to forecast respiratory motion and enhance safety in lung radiotherapy. We used 9 observation records of the 3D position of 3 external markers on the chest and abdomen of healthy individuals breathing during intervals from 73s to 222s. The sampling frequency was 10Hz, and the amplitudes of the recorded trajectories range from 6mm to 40mm in the superior-inferior direction. We forecast the 3D location of each marker simultaneously with a horizon value between 0.1s and 2.0s, using an RNN trained with UORO. We compare its performance with an RNN trained with RTRL, LMS, and offline linear regression. We provide closed-form expressions for quantities involved in the gradient loss calculation in UORO, thereby making its implementation efficient. Training and cross-validation were performed during the first minute of each sequence. On average over the horizon values considered and the 9 sequences, UORO achieves the lowest root-mean-square (RMS) error and maximum error among the compared algorithms. These errors are respectively equal to 1.3mm and 8.8mm, and the prediction time per time step was lower than 2.8ms (Dell Intel core i9-9900K 3.60 GHz). Linear regression has the lowest RMS error for the horizon values 0.1s and 0.2s, followed by LMS for horizon values between 0.3s and 0.5s, and UORO for horizon values greater than 0.6s.
    Randomly Initialized One-Layer Neural Networks Make Data Linearly Separable. (arXiv:2205.11716v1 [cs.LG])
    Recently, neural networks have been shown to perform exceptionally well in transforming two arbitrary sets into two linearly separable sets. Doing this with a randomly initialized neural network is of immense interest because the associated computation is cheaper than using fully trained networks. In this paper, we show that, with sufficient width, a randomly initialized one-layer neural network transforms two sets into two linearly separable sets with high probability. Furthermore, we provide explicit bounds on the required width of the neural network for this to occur. Our first bound is exponential in the input dimension and polynomial in all other parameters, while our second bound is independent of the input dimension, thereby overcoming the curse of dimensionality. We also perform an experimental study comparing the separation capacity of randomly initialized one-layer and two-layer neural networks. With correctly chosen biases, our study shows for low-dimensional data, the two-layer neural network outperforms the one-layer network. However, the opposite is observed for higher-dimensional data.
    Embedding Neighborhoods Simultaneously t-SNE (ENS-t-SNE). (arXiv:2205.11720v1 [cs.LG])
    We propose an algorithm for visualizing a dataset by embedding it in 3-dimensional Euclidean space based on various given distances between the same pairs of datapoints. Its aim is to find an Embedding which preserves Neighborhoods Simultaneously for all given distances by generalizing the t-Stochastic Neighborhood Embedding approach (ENS-t-SNE). We illustrate the utility of ENS-t-SNE by demonstrating its use in three applications. First, to visualize different notions of clusters and groups within the same high-dimensional dataset with one 3-dimensional embedding, as opposed to providing different embeddings of the same data and trying to match the corresponding points. Second, to illustrate the effects of different hyper-parameters of the classical t-SNE. Third, by considering multiple different notions of clustering in data, ENS-t-SNE can generate an alternative embedding than the classic t-SNE. We provide an extensive quantitative evaluation with real-world and synthetic datasets of different sizes and using different numbers of projections.
    High-Order Pooling for Graph Neural Networks with Tensor Decomposition. (arXiv:2205.11691v1 [cs.LG])
    Graph Neural Networks (GNNs) are attracting growing attention due to their effectiveness and flexibility in modeling a variety of graph-structured data. Exiting GNN architectures usually adopt simple pooling operations (e.g., sum, average, max) when aggregating messages from a local neighborhood for updating node representation or pooling node representations from the entire graph to compute the graph representation. Though simple and effective, these linear operations do not model high-order non-linear interactions among nodes. We propose the Tensorized Graph Neural Network (tGNN), a highly expressive GNN architecture relying on tensor decomposition to model high-order non-linear node interactions. tGNN leverages the symmetric CP decomposition to efficiently parameterize permutation-invariant multilinear maps for modeling node interactions. Theoretical and empirical analysis on both node and graph classification tasks show the superiority of tGNN over competitive baselines. In particular, tGNN achieves state-of-the-art results on two OGB node classification datasets and one OGB graph classification dataset.
    Semi-Parametric Deep Neural Networks in Linear Time and Memory. (arXiv:2205.11718v1 [cs.LG])
    Recent advances in deep learning have been driven by large-scale parametric models, which can be computationally expensive and lack interpretability. Semi-parametric methods query the training set at inference time and can be more compact, although they typically have quadratic computational complexity. Here, we introduce SPIN, a general-purpose semi-parametric neural architecture whose computational cost is linear in the size and dimensionality of the data. Our architecture is inspired by inducing point methods and relies on a novel application of cross-attention between datapoints. At inference time, its computational cost is constant in the training set size as the data gets distilled into a fixed number of inducing points. We find that our method reduces the computational requirements of existing semi-parametric models by up to an order of magnitude across a range of datasets and improves state-of-the-art performance on an important practical problem, genotype imputation.
    Distributional Hamilton-Jacobi-Bellman Equations for Continuous-Time Reinforcement Learning. (arXiv:2205.12184v1 [cs.LG])
    Continuous-time reinforcement learning offers an appealing formalism for describing control problems in which the passage of time is not naturally divided into discrete increments. Here we consider the problem of predicting the distribution of returns obtained by an agent interacting in a continuous-time, stochastic environment. Accurate return predictions have proven useful for determining optimal policies for risk-sensitive control, learning state representations, multiagent coordination, and more. We begin by establishing the distributional analogue of the Hamilton-Jacobi-Bellman (HJB) equation for It\^o diffusions and the broader class of Feller-Dynkin processes. We then specialize this equation to the setting in which the return distribution is approximated by $N$ uniformly-weighted particles, a common design choice in distributional algorithms. Our derivation highlights additional terms due to statistical diffusivity which arise from the proper handling of distributions in the continuous-time setting. Based on this, we propose a tractable algorithm for approximately solving the distributional HJB based on a JKO scheme, which can be implemented in an online control algorithm. We demonstrate the effectiveness of such an algorithm in a synthetic control problem.
    Policy Compliance Detection via Expression Tree Inference. (arXiv:2205.12259v1 [cs.CL])
    Policy Compliance Detection (PCD) is a task we encounter when reasoning over texts, e.g. legal frameworks. Previous work to address PCD relies heavily on modeling the task as a special case of Recognizing Textual Entailment. Entailment is applicable to the problem of PCD, however viewing the policy as a single proposition, as opposed to multiple interlinked propositions, yields poor performance and lacks explainability. To address this challenge, more recent proposals for PCD have argued for decomposing policies into expression trees consisting of questions connected with logic operators. Question answering is used to obtain answers to these questions with respect to a scenario. Finally, the expression tree is evaluated in order to arrive at an overall solution. However, this work assumes expression trees are provided by experts, thus limiting its applicability to new policies. In this work, we learn how to infer expression trees automatically from policy texts. We ensure the validity of the inferred trees by introducing constrained decoding using a finite state automaton to ensure the generation of valid trees. We determine through automatic evaluation that 63% of the expression trees generated by our constrained generation model are logically equivalent to gold trees. Human evaluation shows that 88% of trees generated by our model are correct.
    Psychotic Relapse Prediction in Schizophrenia Patients using A Mobile Sensing-based Supervised Deep Learning Model. (arXiv:2205.12225v1 [cs.LG])
    Mobile sensing-based modeling of behavioral changes could predict an oncoming psychotic relapse in schizophrenia patients for timely interventions. Deep learning models could complement existing non-deep learning models for relapse prediction by modeling latent behavioral features relevant to the prediction. However, given the inter-individual behavioral differences, model personalization might be required for a predictive model. In this work, we propose RelapsePredNet, a Long Short-Term Memory (LSTM) neural network-based model for relapse prediction. The model is personalized for a particular patient by training using data from patients most similar to the given patient. Several demographics and baseline mental health scores were considered as personalization metrics to define patient similarity. We investigated the effect of personalization on training dataset characteristics, learned embeddings, and relapse prediction performance. We compared RelapsePredNet with a deep learning-based anomaly detection model for relapse prediction. Further, we investigated if RelapsePredNet could complement ClusterRFModel (a random forest model leveraging clustering and template features proposed in prior work) in a fusion model, by identifying latent behavioral features relevant for relapse prediction. The CrossCheck dataset consisting of continuous mobile sensing data obtained from 63 schizophrenia patients, each monitored for up to a year, was used for our evaluations. The proposed RelapsePredNet outperformed the deep learning-based anomaly detection model for relapse prediction. The F2 score for prediction were 0.21 and 0.52 in the full test set and the Relapse Test Set (consisting of data from patients who have had relapse only), respectively. These corresponded to a 29.4% and 38.8% improvement compared to the existing deep learning-based model for relapse prediction.
    Forecasting Multilinear Data via Transform-Based Tensor Autoregression. (arXiv:2205.12201v1 [cs.LG])
    In the era of big data, there is an increasing demand for new methods for analyzing and forecasting 2-dimensional data. The current research aims to accomplish these goals through the combination of time-series modeling and multilinear algebraic systems. We expand previous autoregressive techniques to forecast multilinear data, aptly named the L-Transform Tensor autoregressive (L-TAR for short). Tensor decompositions and multilinear tensor products have allowed for this approach to be a feasible method of forecasting. We achieve statistical independence between the columns of the observations through invertible discrete linear transforms, enabling a divide and conquer approach. We present an experimental validation of the proposed methods on datasets containing image collections, video sequences, sea surface temperature measurements, stock prices, and networks.
    Learning multi-scale functional representations of proteins from single-cell microscopy data. (arXiv:2205.11676v1 [q-bio.QM])
    Protein function is inherently linked to its localization within the cell, and fluorescent microscopy data is an indispensable resource for learning representations of proteins. Despite major developments in molecular representation learning, extracting functional information from biological images remains a non-trivial computational task. Current state-of-the-art approaches use autoencoder models to learn high-quality features by reconstructing images. However, such methods are prone to capturing noise and imaging artifacts. In this work, we revisit deep learning models used for classifying major subcellular localizations, and evaluate representations extracted from their final layers. We show that simple convolutional networks trained on localization classification can learn protein representations that encapsulate diverse functional information, and significantly outperform autoencoder-based models. We also propose a robust evaluation strategy to assess quality of protein representations across different scales of biological function.
    Compressing Deep Graph Neural Networks via Adversarial Knowledge Distillation. (arXiv:2205.11678v1 [cs.LG])
    Deep graph neural networks (GNNs) have been shown to be expressive for modeling graph-structured data. Nevertheless, the over-stacked architecture of deep graph models makes it difficult to deploy and rapidly test on mobile or embedded systems. To compress over-stacked GNNs, knowledge distillation via a teacher-student architecture turns out to be an effective technique, where the key step is to measure the discrepancy between teacher and student networks with predefined distance functions. However, using the same distance for graphs of various structures may be unfit, and the optimal distance formulation is hard to determine. To tackle these problems, we propose a novel Adversarial Knowledge Distillation framework for graph models named GraphAKD, which adversarially trains a discriminator and a generator to adaptively detect and decrease the discrepancy. Specifically, noticing that the well-captured inter-node and inter-class correlations favor the success of deep GNNs, we propose to criticize the inherited knowledge from node-level and class-level views with a trainable discriminator. The discriminator distinguishes between teacher knowledge and what the student inherits, while the student GNN works as a generator and aims to fool the discriminator. To our best knowledge, GraphAKD is the first to introduce adversarial training to knowledge distillation in graph domains. Experiments on node-level and graph-level classification benchmarks demonstrate that GraphAKD improves the student performance by a large margin. The results imply that GraphAKD can precisely transfer knowledge from a complicated teacher GNN to a compact student GNN.
    FlexiBERT: Are Current Transformer Architectures too Homogeneous and Rigid?. (arXiv:2205.11656v1 [cs.LG])
    The existence of a plethora of language models makes the problem of selecting the best one for a custom task challenging. Most state-of-the-art methods leverage transformer-based models (e.g., BERT) or their variants. Training such models and exploring their hyperparameter space, however, is computationally expensive. Prior work proposes several neural architecture search (NAS) methods that employ performance predictors (e.g., surrogate models) to address this issue; however, analysis has been limited to homogeneous models that use fixed dimensionality throughout the network. This leads to sub-optimal architectures. To address this limitation, we propose a suite of heterogeneous and flexible models, namely FlexiBERT, that have varied encoder layers with a diverse set of possible operations and different hidden dimensions. For better-posed surrogate modeling in this expanded design space, we propose a new graph-similarity-based embedding scheme. We also propose a novel NAS policy, called BOSHNAS, that leverages this new scheme, Bayesian modeling, and second-order optimization, to quickly train and use a neural surrogate model to converge to the optimal architecture. A comprehensive set of experiments shows that the proposed policy, when applied to the FlexiBERT design space, pushes the performance frontier upwards compared to traditional models. FlexiBERT-Mini, one of our proposed models, has 3% fewer parameters than BERT-Mini and achieves 8.9% higher GLUE score. A FlexiBERT model with equivalent performance as the best homogeneous model achieves 2.6x smaller size. FlexiBERT-Large, another proposed model, achieves state-of-the-art results, outperforming the baseline models by at least 5.7% on the GLUE benchmark.
    Improving Fairness for Data Valuation in Horizontal Federated Learning. (arXiv:2109.09046v3 [cs.LG] UPDATED)
    Federated learning is an emerging decentralized machine learning scheme that allows multiple data owners to work collaboratively while ensuring data privacy. The success of federated learning depends largely on the participation of data owners. To sustain and encourage data owners' participation, it is crucial to fairly evaluate the quality of the data provided by the data owners and reward them correspondingly. Federated Shapley value, recently proposed by Wang et al. [Federated Learning, 2020], is a measure for data value under the framework of federated learning that satisfies many desired properties for data valuation. However, there are still factors of potential unfairness in the design of federated Shapley value because two data owners with the same local data may not receive the same evaluation. We propose a new measure called completed federated Shapley value to improve the fairness of federated Shapley value. The design depends on completing a matrix consisting of all the possible contributions by different subsets of the data owners. It is shown under mild conditions that this matrix is approximately low-rank by leveraging concepts and tools from optimization. Both theoretical analysis and empirical evaluation verify that the proposed measure does improve fairness in many circumstances.
    Intelligence Primer. (arXiv:2008.07324v3 [cs.AI] UPDATED)
    Intelligence is a fundamental part of all living things, as well as the foundation for Artificial Intelligence. In this primer we explore the ideas associated with intelligence and, by doing so, understand the implications and constraints and potentially outline the capabilities of future systems. Artificial Intelligence, in the form of Machine Learning, has already had a significant impact on our lives. As an exploration, we journey into different parts of intelligence that appear essential. We hope that people find this helpful in determining the future. Also, during the exploration, we hope to create new thought-provoking questions. Intelligence is not a single weighable quantity but a subject that spans Biology, Physics, Philosophy, Cognitive Science, Neuroscience, Psychology, and Computer Science. The historian Yuval Noah Harari pointed out that engineers and scientists in the future will have to broaden their understandings to include disciplines such as Psychology, Philosophy, and Ethics. Fiction writers have long portrayed engineers and scientists as deficient in these areas. Today, in modern society, the emergence of Artificial Intelligence and legal requirements act as forcing functions to push these broader subjects into the foreground. We start with an introduction to intelligence and move quickly to more profound thoughts and ideas. We call this a Life, the Universe, and Everything primer, after the famous science fiction book by Douglas Adams. Forty-two may be the correct answer, but what are the questions?
    Throwing Away Data Improves Worst-Class Error in Imbalanced Classification. (arXiv:2205.11672v1 [stat.ML])
    Class imbalances pervade classification problems, yet their treatment differs in theory and practice. On the one hand, learning theory instructs us that \emph{more data is better}, as sample size relates inversely to the average test error over the entire data distribution. On the other hand, practitioners have long developed a plethora of tricks to improve the performance of learning machines over imbalanced data. These include data reweighting and subsampling, synthetic construction of additional samples from minority classes, ensembling expensive one-versus all architectures, and tweaking classification losses and thresholds. All of these are efforts to minimize the worst-class error, which is often associated to the minority group in the training data, and finds additional motivation in the robustness, fairness, and out-of-distribution literatures. Here we take on the challenge of developing learning theory able to describe the worst-class error of classifiers over linearly-separable data when fitted either on (i) the full training set, or (ii) a subset where the majority class is subsampled to match in size the minority class. We borrow tools from extreme value theory to show that, under distributions with certain tail properties, \emph{throwing away most data from the majority class leads to better worst-class error}.
    Eigenvalue and Generalized Eigenvalue Problems: Tutorial. (arXiv:1903.11240v2 [stat.ML] UPDATED)
    This paper is a tutorial for eigenvalue and generalized eigenvalue problems. We first introduce eigenvalue problem, eigen-decomposition (spectral decomposition), and generalized eigenvalue problem. Then, we mention the optimization problems which yield to the eigenvalue and generalized eigenvalue problems. We also provide examples from machine learning, including principal component analysis, kernel supervised principal component analysis, and Fisher discriminant analysis, which result in eigenvalue and generalized eigenvalue problems. Finally, we introduce the solutions to both eigenvalue and generalized eigenvalue problems.
    Utilizing Language-Image Pretraining for Efficient and Robust Bilingual Word Alignment. (arXiv:2205.11616v1 [cs.CL])
    Word translation without parallel corpora has become feasible, rivaling the performance of supervised methods. Recent findings have shown that the accuracy and robustness of unsupervised word translation (UWT) can be improved by making use of visual observations, which are universal representations across languages. In this work, we investigate the potential of using not only visual observations but also pretrained language-image models for enabling a more efficient and robust UWT. Specifically, we develop a novel UWT method dubbed Word Alignment using Language-Image Pretraining (WALIP), which leverages visual observations via the shared embedding space of images and texts provided by CLIP models (Radford et al., 2021). WALIP has a two-step procedure. First, we retrieve word pairs with high confidences of similarity, computed using our proposed image-based fingerprints, which define the initial pivot for the word alignment. Second, we apply our robust Procrustes algorithm to estimate the linear mapping between two embedding spaces, which iteratively corrects and refines the estimated alignment. Our extensive experiments show that WALIP improves upon the state-of-the-art performance of bilingual word alignment for a few language pairs across different word embeddings and displays great robustness to the dissimilarity of language pairs or training corpora for two word embeddings.
    Deep Representations for Time-varying Brain Datasets. (arXiv:2205.11648v1 [cs.LG])
    Finding an appropriate representation of dynamic activities in the brain is crucial for many downstream applications. Due to its highly dynamic nature, temporally averaged fMRI (functional magnetic resonance imaging) can only provide a narrow view of underlying brain activities. Previous works lack the ability to learn and interpret the latent dynamics in brain architectures. This paper builds an efficient graph neural network model that incorporates both region-mapped fMRI sequences and structural connectivities obtained from DWI (diffusion-weighted imaging) as inputs. We find good representations of the latent brain dynamics through learning sample-level adaptive adjacency matrices and performing a novel multi-resolution inner cluster smoothing. These modules can be easily adapted to and are potentially useful for other applications outside the neuroscience domain. We also attribute inputs with integrated gradients, which enables us to infer (1) highly involved brain connections and subnetworks for each task, (2) temporal keyframes of imaging sequences that characterize tasks, and (3) subnetworks that discriminate between individual subjects. This ability to identify critical subnetworks that characterize signal states across heterogeneous tasks and individuals is of great importance to neuroscience and other scientific domains. Extensive experiments and ablation studies demonstrate our proposed method's superiority and efficiency in spatial-temporal graph signal modeling with insightful interpretations of brain dynamics.
    A Natural Language Processing Pipeline for Detecting Informal Data References in Academic Literature. (arXiv:2205.11651v1 [cs.DL])
    Discovering authoritative links between publications and the datasets that they use can be a labor-intensive process. We introduce a natural language processing pipeline that retrieves and reviews publications for informal references to research datasets, which complements the work of data librarians. We first describe the components of the pipeline and then apply it to expand an authoritative bibliography linking thousands of social science studies to the data-related publications in which they are used. The pipeline increases recall for literature to review for inclusion in data-related collections of publications and makes it possible to detect informal data references at scale. We contribute (1) a novel Named Entity Recognition (NER) model that reliably detects informal data references and (2) a dataset connecting items from social science literature with datasets they reference. Together, these contributions enable future work on data reference, data citation networks, and data reuse.
    DOGE-Train: Discrete Optimization on GPU with End-to-end Training. (arXiv:2205.11638v1 [cs.LG])
    We present a fast, scalable, data-driven approach for solving linear relaxations of 0-1 integer linear programs using a graph neural network. Our solver is based on the Lagrange decomposition based algorithm FastDOG (Abbas et al. (2022)). We make the algorithm differentiable and perform backpropagation through the dual update scheme for end-to-end training of its algorithmic parameters. This allows to preserve the algorithm's theoretical properties including feasibility and guaranteed non-decrease in the lower bound. Since FastDOG can get stuck in suboptimal fixed points, we provide additional freedom to our graph neural network to predict non-parametric update steps for escaping such points while maintaining dual feasibility. For training of the graph neural network we use an unsupervised loss and perform experiments on large-scale real world datasets. We train on smaller problems and test on larger ones showing strong generalization performance with a graph neural network comprising only around 10k parameters. Our solver achieves significantly faster performance and better dual objectives than its non-learned version. In comparison to commercial solvers our learned solver achieves close to optimal objective values of LP relaxations and is faster by up to an order of magnitude on very large problems from structured prediction and on selected combinatorial optimization problems.
    Generalization Gap in Amortized Inference. (arXiv:2205.11640v1 [stat.ML])
    The ability of likelihood-based probabilistic models to generalize to unseen data is central to many machine learning applications such as lossless compression. In this work, we study the generalizations of a popular class of probabilistic models - the Variational Auto-Encoder (VAE). We point out the two generalization gaps that can affect the generalization ability of VAEs and show that the over-fitting phenomenon is usually dominated by the amortized inference network. Based on this observation we propose a new training objective, inspired by the classic wake-sleep algorithm, to improve the generalizations properties of amortized inference. We also demonstrate how it can improve generalization performance in the context of image modeling and lossless compression.
    History Compression via Language Models in Reinforcement Learning. (arXiv:2205.12258v1 [cs.LG])
    In a partially observable Markov decision process (POMDP), an agent typically uses a representation of the past to approximate the underlying MDP. We propose to utilize a frozen Pretrained Language Transformer (PLT) for history representation and compression to improve sample efficiency. To avoid training of the Transformer, we introduce FrozenHopfield, which automatically associates observations with original token embeddings. To form these associations, a modern Hopfield network stores the original token embeddings, which are retrieved by queries that are obtained by a random but fixed projection of observations. Our new method, HELM, enables actor-critic network architectures that contain a pretrained language Transformer for history representation as a memory module. Since a representation of the past need not be learned, HELM is much more sample efficient than competitors. On Minigrid and Procgen environments HELM achieves new state-of-the-art results. Our code is available at https://github.com/ml-jku/helm.
    uGLAD: Sparse graph recovery by optimizing deep unrolled networks. (arXiv:2205.11610v1 [cs.LG])
    Probabilistic Graphical Models (PGMs) are generative models of complex systems. They rely on conditional independence assumptions between variables to learn sparse representations which can be visualized in a form of a graph. Such models are used for domain exploration and structure discovery in poorly understood domains. This work introduces a novel technique to perform sparse graph recovery by optimizing deep unrolled networks. Assuming that the input data $X\in\mathbb{R}^{M\times D}$ comes from an underlying multivariate Gaussian distribution, we apply a deep model on $X$ that outputs the precision matrix $\Theta$, which can also be interpreted as the adjacency matrix. Our model, uGLAD, builds upon and extends the state-of-the-art model GLAD to the unsupervised setting. The key benefits of our model are (1) uGLAD automatically optimizes sparsity-related regularization parameters leading to better performance than existing algorithms. (2) We introduce multi-task learning based `consensus' strategy for robust handling of missing data in an unsupervised setting. We evaluate model results on synthetic Gaussian data, non-Gaussian data generated from Gene Regulatory Networks, and present a case study in anaerobic digestion.
    PrivFairFL: Privacy-Preserving Group Fairness in Federated Learning. (arXiv:2205.11584v1 [cs.LG])
    Group fairness ensures that the outcome of machine learning (ML) based decision making systems are not biased towards a certain group of people defined by a sensitive attribute such as gender or ethnicity. Achieving group fairness in Federated Learning (FL) is challenging because mitigating bias inherently requires using the sensitive attribute values of all clients, while FL is aimed precisely at protecting privacy by not giving access to the clients' data. As we show in this paper, this conflict between fairness and privacy in FL can be resolved by combining FL with Secure Multiparty Computation (MPC) and Differential Privacy (DP). In doing so, we propose a method for training group-fair ML models in cross-device FL under complete and formal privacy guarantees, without requiring the clients to disclose their sensitive attribute values.
    Quasi Black-Box Variational Inference with Natural Gradients for Bayesian Learning. (arXiv:2205.11568v1 [stat.ML])
    We develop an optimization algorithm suitable for Bayesian learning in complex models. Our approach relies on natural gradient updates within a general black-box framework for efficient training with limited model-specific derivations. It applies within the class of exponential-family variational posterior distributions, for which we extensively discuss the Gaussian case for which the updates have a rather simple form. Our Quasi Black-box Variational Inference (QBVI) framework is readily applicable to a wide class of Bayesian inference problems and is of simple implementation as the updates of the variational posterior do not involve gradients with respect to the model parameters, nor the prescription of the Fisher information matrix. We develop QBVI under different hypotheses for the posterior covariance matrix, discuss details about its robust and feasible implementation, and provide a number of real-world applications to demonstrate its effectiveness.
    Cardiomegaly Detection using Deep Convolutional Neural Network with U-Net. (arXiv:2205.11515v1 [eess.IV])
    Cardiomegaly is indeed a medical disease in which the heart is enlarged. Cardiomegaly is better to handle if caught early, so early detection is critical. The chest X-ray, being one of the most often used radiography examinations, has been used to detect and visualize abnormalities of human organs for decades. X-ray is also a significant medical diagnosis tool for cardiomegaly. Even for domain experts, distinguishing the many types of diseases from the X-ray is a difficult and time-consuming task. Deep learning models are also most effective when used on huge data sets, yet due to privacy concerns, large datasets are rarely available inside the medical industry. A Deep learning-based customized retrained U-Net model for detecting Cardiomegaly disease is presented in this research. In the training phase, chest X-ray images from the "ChestX-ray8" open source real dataset are used. To reduce computing time, this model performs data preprocessing, picture improvement, image compression, and classification before moving on to the training step. The work used a chest x-ray image dataset to simulate and produced a diagnostic accuracy of 94%, a sensitivity of 96.2 percent, and a specificity of 92.5 percent, which beats prior pre-trained model findings for identifying Cardiomegaly disease.
    BolT: Fused Window Transformers for fMRI Time Series Analysis. (arXiv:2205.11578v1 [eess.SP])
    Functional magnetic resonance imaging (fMRI) enables examination of inter-regional interactions in the brain via functional connectivity (FC) analyses that measure the synchrony between the temporal activations of separate regions. Given their exceptional sensitivity, deep-learning methods have received growing interest for FC analyses of high-dimensional fMRI data. In this domain, models that operate directly on raw time series as opposed to pre-computed FC features have the potential benefit of leveraging the full scale of information present in fMRI data. However, previous models are based on architectures suboptimal for temporal integration of representations across multiple time scales. Here, we present BolT, blood-oxygen-level-dependent transformer, for analyzing multi-variate fMRI time series. BolT leverages a cascade of transformer encoders equipped with a novel fused window attention mechanism. Transformer encoding is performed on temporally-overlapped time windows within the fMRI time series to capture short time-scale representations. To integrate information across windows, cross-window attention is computed between base tokens in each time window and fringe tokens from neighboring time windows. To transition from local to global representations, the extent of window overlap and thereby number of fringe tokens is progressively increased across the cascade. Finally, a novel cross-window regularization is enforced to align the high-level representations of global $CLS$ features across time windows. Comprehensive experiments on public fMRI datasets clearly illustrate the superior performance of BolT against state-of-the-art methods. Posthoc explanatory analyses to identify landmark time points and regions that contribute most significantly to model decisions corroborate prominent neuroscientific findings from recent fMRI studies.
    Privacy-preserving Data Filtering in Federated Learning Using Influence Approximation. (arXiv:2205.11518v1 [cs.CR])
    Federated Learning by nature is susceptible to low-quality, corrupted, or even malicious data that can severely degrade the quality of the learned model. Traditional techniques for data valuation cannot be applied as the data is never revealed. We present a novel technique for filtering, and scoring data based on a practical influence approximation that can be implemented in a privacy-preserving manner. Each agent uses his own data to evaluate the influence of another agent's batch, and reports to the center an obfuscated score using differential privacy. Our technique allows for almost perfect ($>92\%$ recall) filtering of corrupted data in a variety of applications using real-data. Importantly, the accuracy does not degrade significantly, even under really strong privacy guarantees ($\varepsilon \leq 1$), especially under realistic percentages of mislabeled data (for $15\%$ mislabeled data we only lose $10\%$ in accuracy).
    Byzantine Machine Learning Made Easy by Resilient Averaging of Momentums. (arXiv:2205.12173v1 [cs.LG])
    Byzantine resilience emerged as a prominent topic within the distributed machine learning community. Essentially, the goal is to enhance distributed optimization algorithms, such as distributed SGD, in a way that guarantees convergence despite the presence of some misbehaving (a.k.a., {\em Byzantine}) workers. Although a myriad of techniques addressing the problem have been proposed, the field arguably rests on fragile foundations. These techniques are hard to prove correct and rely on assumptions that are (a) quite unrealistic, i.e., often violated in practice, and (b) heterogeneous, i.e., making it difficult to compare approaches. We present \emph{RESAM (RESilient Averaging of Momentums)}, a unified framework that makes it simple to establish optimal Byzantine resilience, relying only on standard machine learning assumptions. Our framework is mainly composed of two operators: \emph{resilient averaging} at the server and \emph{distributed momentum} at the workers. We prove a general theorem stating the convergence of distributed SGD under RESAM. Interestingly, demonstrating and comparing the convergence of many existing techniques become direct corollaries of our theorem, without resorting to stringent assumptions. We also present an empirical evaluation of the practical relevance of RESAM.
    Learning for Expressive Task-Related Sentence Representations. (arXiv:2205.12186v1 [cs.CL])
    NLP models learn sentence representations for downstream tasks by tuning a model which is pre-trained by masked language modeling. However, after tuning, the learned sentence representations may be skewed heavily toward label space and thus are not expressive enough to represent whole samples, which should contain task-related information of both sentence inputs and labels. In this work, we learn expressive sentence representations for supervised tasks which (1). contain task-related information in the sentence inputs, and (2). enable correct label predictions. To achieve this goal, we first propose a new objective which explicitly points out the label token space in the input, and predicts categories of labels via an added [MASK] token. This objective encourages fusing the semantic information of both the label and sentence. Then we develop a neighbor attention module, added on a frozen pre-trained model, to build connections between label/sentence tokens via their neighbors. The propagation can be further guided by the regularization on neighborhood representations to encourage expressiveness. Experimental results show that, despite tuning only 5% additional parameters over a frozen pre-trained model, our model can achieve classification results comparable to the SOTA while maintaining strong expressiveness as well.
    Compression-aware Training of Neural Networks using Frank-Wolfe. (arXiv:2205.11921v1 [cs.LG])
    Many existing Neural Network pruning approaches either rely on retraining to compensate for pruning-caused performance degradation or they induce strong biases to converge to a specific sparse solution throughout training. A third paradigm obtains a wide range of compression ratios from a single dense training run while also avoiding retraining. Recent work of Pokutta et al. (2020) and Miao et al. (2022) suggests that the Stochastic Frank-Wolfe (SFW) algorithm is particularly suited for training state-of-the-art models that are robust to compression. We propose leveraging $k$-support norm ball constraints and demonstrate significant improvements over the results of Miao et al. (2022) in the case of unstructured pruning. We also extend these ideas to the structured pruning domain and propose novel approaches to both ensure robustness to the pruning of convolutional filters as well as to low-rank tensor decompositions of convolutional layers. In the latter case, our approach performs on-par with nuclear-norm regularization baselines while requiring only half of the computational resources. Our findings also indicate that the robustness of SFW-trained models largely depends on the gradient rescaling of the learning rate and we establish a theoretical foundation for that practice.
    Neur2SP: Neural Two-Stage Stochastic Programming. (arXiv:2205.12006v1 [math.OC])
    Stochastic programming is a powerful modeling framework for decision-making under uncertainty. In this work, we tackle two-stage stochastic programs (2SPs), the most widely applied and studied class of stochastic programming models. Solving 2SPs exactly requires evaluation of an expected value function that is computationally intractable. Additionally, having a mixed-integer linear program (MIP) or a nonlinear program (NLP) in the second stage further aggravates the problem difficulty. In such cases, solving them can be prohibitively expensive even if specialized algorithms that exploit problem structure are employed. Finding high-quality (first-stage) solutions -- without leveraging problem structure -- can be crucial in such settings. We develop Neur2SP, a new method that approximates the expected value function via a neural network to obtain a surrogate model that can be solved more efficiently than the traditional extensive formulation approach. Moreover, Neur2SP makes no assumptions about the problem structure, in particular about the second-stage problem, and can be implemented using an off-the-shelf solver and open-source libraries. Our extensive computational experiments on benchmark 2SP datasets from four problem classes with different structures (containing MIP and NLP second-stage problems) show the efficiency (time) and efficacy (solution quality) of Neur2SP. Specifically, the proposed method takes less than 1.66 seconds across all problems, achieving high-quality solutions even as the number of scenarios increases, an ideal property that is difficult to have for traditional 2SP solution techniques. Namely, the most generic baseline method typically requires minutes to hours to find solutions of comparable quality.
    Deep Low-Density Separation for Semi-Supervised Classification. (arXiv:2205.11995v1 [cs.LG])
    Given a small set of labeled data and a large set of unlabeled data, semi-supervised learning (SSL) attempts to leverage the location of the unlabeled datapoints in order to create a better classifier than could be obtained from supervised methods applied to the labeled training set alone. Effective SSL imposes structural assumptions on the data, e.g. that neighbors are more likely to share a classification or that the decision boundary lies in an area of low density. For complex and high-dimensional data, neural networks can learn feature embeddings to which traditional SSL methods can then be applied in what we call hybrid methods. Previously-developed hybrid methods iterate between refining a latent representation and performing graph-based SSL on this representation. In this paper, we introduce a novel hybrid method that instead applies low-density separation to the embedded features. We describe it in detail and discuss why low-density separation may be better suited for SSL on neural network-based embeddings than graph-based algorithms. We validate our method using in-house customer survey data and compare it to other state-of-the-art learning methods. Our approach effectively classifies thousands of unlabeled users from a relatively small number of hand-classified examples.
    Ensemble Multi-Relational Graph Neural Networks. (arXiv:2205.12076v1 [cs.LG])
    It is well established that graph neural networks (GNNs) can be interpreted and designed from the perspective of optimization objective. With this clear optimization objective, the deduced GNNs architecture has sound theoretical foundation, which is able to flexibly remedy the weakness of GNNs. However, this optimization objective is only proved for GNNs with single-relational graph. Can we infer a new type of GNNs for multi-relational graphs by extending this optimization objective, so as to simultaneously solve the issues in previous multi-relational GNNs, e.g., over-parameterization? In this paper, we propose a novel ensemble multi-relational GNNs by designing an ensemble multi-relational (EMR) optimization objective. This EMR optimization objective is able to derive an iterative updating rule, which can be formalized as an ensemble message passing (EnMP) layer with multi-relations. We further analyze the nice properties of EnMP layer, e.g., the relationship with multi-relational personalized PageRank. Finally, a new multi-relational GNNs which well alleviate the over-smoothing and over-parameterization issues are proposed. Extensive experiments conducted on four benchmark datasets well demonstrate the effectiveness of the proposed model.
    Faithful Explanations for Deep Graph Models. (arXiv:2205.11850v1 [cs.LG])
    This paper studies faithful explanations for Graph Neural Networks (GNNs). First, we provide a new and general method for formally characterizing the faithfulness of explanations for GNNs. It applies to existing explanation methods, including feature attributions and subgraph explanations. Second, our analytical and empirical results demonstrate that feature attribution methods cannot capture the nonlinear effect of edge features, while existing subgraph explanation methods are not faithful. Third, we introduce \emph{k-hop Explanation with a Convolutional Core} (KEC), a new explanation method that provably maximizes faithfulness to the original GNN by leveraging information about the graph structure in its adjacency matrix and its \emph{k-th} power. Lastly, our empirical results over both synthetic and real-world datasets for classification and anomaly detection tasks with GNNs demonstrate the effectiveness of our approach.
    Quadratic models for understanding neural network dynamics. (arXiv:2205.11787v1 [cs.LG])
    In this work, we propose using a quadratic model as a tool for understanding properties of wide neural networks in both optimization and generalization. We show analytically that certain deep learning phenomena such as the "catapult phase" from [Lewkowycz et al. 2020], which cannot be captured by linear models, are manifested in the quadratic model for shallow ReLU networks. Furthermore, our empirical results indicate that the behaviour of quadratic models parallels that of neural networks in generalization, especially in the large learning rate regime. We expect that quadratic models will serve as a useful tool for analysis of neural networks.
    Concurrent Credit Assignment for Data-efficient Reinforcement Learning. (arXiv:2205.12020v1 [cs.LG])
    The capability to widely sample the state and action spaces is a key ingredient toward building effective reinforcement learning algorithms. The variational optimization principles exposed in this paper emphasize the importance of an occupancy model to synthesizes the general distribution of the agent's environmental states over which it can act (defining a virtual ``territory''). The occupancy model is the subject of frequent updates as the exploration progresses and that new states are undisclosed during the course of the training. By making a uniform prior assumption, the resulting objective expresses a balance between two concurrent tendencies, namely the widening of the occupancy space and the maximization of the rewards, reminding of the classical exploration/exploitation trade-off. Implemented on an actor-critic off-policy on classic continuous action benchmarks, it is shown to provide significant increase in the sampling efficacy, that is reflected in a reduced training time and higher returns, in both the dense and the sparse rewards cases.
    Quantum Kerr Learning. (arXiv:2205.12004v1 [quant-ph])
    Quantum machine learning is a rapidly evolving area that could facilitate important applications for quantum computing and significantly impact data science. In our work, we argue that a single Kerr mode might provide some extra quantum enhancements when using quantum kernel methods based on various reasons from complexity theory and physics. Furthermore, we establish an experimental protocol, which we call \emph{quantum Kerr learning} based on circuit QED. A detailed study using the kernel method, neural tangent kernel theory, first-order perturbation theory of the Kerr non-linearity, and non-perturbative numerical simulations, shows quantum enhancements could happen in terms of the convergence time and the generalization error, while explicit protocols are also constructed for higher-dimensional input data.
    Pynblint: a Static Analyzer for Python Jupyter Notebooks. (arXiv:2205.11934v1 [cs.SE])
    Jupyter Notebook is the tool of choice of many data scientists in the early stages of ML workflows. The notebook format, however, has been criticized for inducing bad programming practices; indeed, researchers have already shown that open-source repositories are inundated by poor-quality notebooks. Low-quality output from the prototypical stages of ML workflows constitutes a clear bottleneck towards the productization of ML models. To foster the creation of better notebooks, we developed Pynblint, a static analyzer for Jupyter notebooks written in Python. The tool checks the compliance of notebooks (and surrounding repositories) with a set of empirically validated best practices and provides targeted recommendations when violations are detected.
    Transition to Linearity of General Neural Networks with Directed Acyclic Graph Architecture. (arXiv:2205.11786v1 [cs.LG])
    In this paper we show that feedforward neural networks corresponding to arbitrary directed acyclic graphs undergo transition to linearity as their "width" approaches infinity. The width of these general networks is characterized by the minimum in-degree of their neurons, except for the input and first layers. Our results identify the mathematical structure underlying transition to linearity and generalize a number of recent works aimed at characterizing transition to linearity or constancy of the Neural Tangent Kernel for standard architectures.
    Quarantine: Sparsity Can Uncover the Trojan Attack Trigger for Free. (arXiv:2205.11819v1 [cs.LG])
    Trojan attacks threaten deep neural networks (DNNs) by poisoning them to behave normally on most samples, yet to produce manipulated results for inputs attached with a particular trigger. Several works attempt to detect whether a given DNN has been injected with a specific trigger during the training. In a parallel line of research, the lottery ticket hypothesis reveals the existence of sparse subnetworks which are capable of reaching competitive performance as the dense network after independent training. Connecting these two dots, we investigate the problem of Trojan DNN detection from the brand new lens of sparsity, even when no clean training data is available. Our crucial observation is that the Trojan features are significantly more stable to network pruning than benign features. Leveraging that, we propose a novel Trojan network detection regime: first locating a "winning Trojan lottery ticket" which preserves nearly full Trojan information yet only chance-level performance on clean inputs; then recovering the trigger embedded in this already isolated subnetwork. Extensive experiments on various datasets, i.e., CIFAR-10, CIFAR-100, and ImageNet, with different network architectures, i.e., VGG-16, ResNet-18, ResNet-20s, and DenseNet-100 demonstrate the effectiveness of our proposal. Codes are available at https://github.com/VITA-Group/Backdoor-LTH.
    Functional Network: A Novel Framework for Interpretability of Deep Neural Networks. (arXiv:2205.11702v1 [cs.LG])
    The layered structure of deep neural networks hinders the use of numerous analysis tools and thus the development of its interpretability. Inspired by the success of functional brain networks, we propose a novel framework for interpretability of deep neural networks, that is, the functional network. We construct the functional network of fully connected networks and explore its small-worldness. In our experiments, the mechanisms of regularization methods, namely, batch normalization and dropout, are revealed using graph theoretical analysis and topological data analysis. Our empirical analysis shows the following: (1) Batch normalization enhances model performance by increasing the global e ciency and the number of loops but reduces adversarial robustness by lowering the fault tolerance. (2) Dropout improves generalization and robustness of models by improving the functional specialization and fault tolerance. (3) The models with dierent regularizations can be clustered correctly according to their functional topological dierences, re ecting the great potential of the functional network and topological data analysis in interpretability.
    Soft-SVM Regression For Binary Classification. (arXiv:2205.11735v1 [stat.ML])
    The binomial deviance and the SVM hinge loss functions are two of the most widely used loss functions in machine learning. While there are many similarities between them, they also have their own strengths when dealing with different types of data. In this work, we introduce a new exponential family based on a convex relaxation of the hinge loss function using softness and class-separation parameters. This new family, denoted Soft-SVM, allows us to prescribe a generalized linear model that effectively bridges between logistic regression and SVM classification. This new model is interpretable and avoids data separability issues, attaining good fitting and predictive performance by automatically adjusting for data label separability via the softness parameter. These results are confirmed empirically through simulations and case studies as we compare regularized logistic, SVM, and Soft-SVM regressions and conclude that the proposed model performs well in terms of both classification and prediction errors.
    RCC-GAN: Regularized Compound Conditional GAN for Large-Scale Tabular Data Synthesis. (arXiv:2205.11693v1 [cs.LG])
    This paper introduces a novel generative adversarial network (GAN) for synthesizing large-scale tabular databases which contain various features such as continuous, discrete, and binary. Technically, our GAN belongs to the category of class-conditioned generative models with a predefined conditional vector. However, we propose a new formulation for deriving such a vector incorporating both binary and discrete features simultaneously. We refer to this noble definition as compound conditional vector and employ it for training the generator network. The core architecture of this network is a three-layered deep residual neural network with skip connections. For improving the stability of such complex architecture, we present a regularization scheme towards limiting unprecedented variations on its weight vectors during training. This regularization approach is quite compatible with the nature of adversarial training and it is not computationally prohibitive in runtime. Furthermore, we constantly monitor the variation of the weight vectors for identifying any potential instabilities or irregularities to measure the strength of our proposed regularizer. Toward this end, we also develop a new metric for tracking sudden perturbation on the weight vectors using the singular value decomposition theory. Finally, we evaluate the performance of our proposed synthesis approach on six benchmarking tabular databases, namely Adult, Census, HCDR, Cabs, News, and King. The achieved results corroborate that for the majority of the cases, our proposed RccGAN outperforms other conventional and modern generative models in terms of accuracy, stability, and reliability.
    Byzantine-Robust Federated Learning with Optimal Statistical Rates and Privacy Guarantees. (arXiv:2205.11765v1 [cs.LG])
    We propose Byzantine-robust federated learning protocols with nearly optimal statistical rates. In contrast to prior work, our proposed protocols improve the dimension dependence and achieve a tight statistical rate in terms of all the parameters for strongly convex losses. We benchmark against competing protocols and show the empirical superiority of the proposed protocols. Finally, we remark that our protocols with bucketing can be naturally combined with privacy-guaranteeing procedures to introduce security against a semi-honest server. The code for evaluation is provided in https://github.com/wanglun1996/secure-robust-federated-learning.
    Towards a Defense against Backdoor Attacks in Continual Federated Learning. (arXiv:2205.11736v1 [cs.LG])
    Backdoor attacks are a major concern in federated learning (FL) pipelines where training data is sourced from untrusted clients over long periods of time (i.e., continual learning). Preventing such attacks is difficult because defenders in FL do not have access to raw training data. Moreover, in a phenomenon we call backdoor leakage, models trained continuously eventually suffer from backdoors due to cumulative errors in backdoor defense mechanisms. We propose a novel framework for defending against backdoor attacks in the federated continual learning setting. Our framework trains two models in parallel: a backbone model and a shadow model. The backbone is trained without any defense mechanism to obtain good performance on the main task. The shadow model combines recent ideas from robust covariance estimation-based filters with early-stopping to control the attack success rate even as the data distribution changes. We provide theoretical motivation for this design and show experimentally that our framework significantly improves upon existing defenses against backdoor attacks.
    MOSPAT: AutoML based Model Selection and Parameter Tuning for Time Series Anomaly Detection. (arXiv:2205.11755v1 [cs.LG])
    Organizations leverage anomaly and changepoint detection algorithms to detect changes in user behavior or service availability and performance. Many off-the-shelf detection algorithms, though effective, cannot readily be used in large organizations where thousands of users monitor millions of use cases and metrics with varied time series characteristics and anomaly patterns. The selection of algorithm and parameters needs to be precise for each use case: manual tuning does not scale, and automated tuning requires ground truth, which is rarely available. In this paper, we explore MOSPAT, an end-to-end automated machine learning based approach for model and parameter selection, combined with a generative model to produce labeled data. Our scalable end-to-end system allows individual users in large organizations to tailor time-series monitoring to their specific use case and data characteristics, without expert knowledge of anomaly detection algorithms or laborious manual labeling. Our extensive experiments on real and synthetic data demonstrate that this method consistently outperforms using any single algorithm.
    Semi-Supervised Clustering of Sparse Graphs: Crossing the Information-Theoretic Threshold. (arXiv:2205.11677v1 [stat.ML])
    The stochastic block model is a canonical random graph model for clustering and community detection on network-structured data. Decades of extensive study on the problem have established many profound results, among which the phase transition at the Kesten-Stigum threshold is particularly interesting both from a mathematical and an applied standpoint. It states that no estimator based on the network topology can perform substantially better than chance on sparse graphs if the model parameter is below certain threshold. Nevertheless, if we slightly extend the horizon to the ubiquitous semi-supervised setting, such a fundamental limitation will disappear completely. We prove that with arbitrary fraction of the labels revealed, the detection problem is feasible throughout the parameter domain. Moreover, we introduce two efficient algorithms, one combinatorial and one based on optimization, to integrate label information with graph structures. Our work brings a new perspective to stochastic model of networks and semidefinite program research.
    PCA-Boosted Autoencoders for Nonlinear Dimensionality Reduction in Low Data Regimes. (arXiv:2205.11673v1 [cs.LG])
    Autoencoders (AE) provide a useful method for nonlinear dimensionality reduction but are ill-suited for low data regimes. Conversely, Principal Component Analysis (PCA) is data-efficient but is limited to linear dimensionality reduction, posing a problem when data exhibits inherent nonlinearity. This presents a challenge in various scientific and engineering domains such as the nanophotonic component design, where data exhibits nonlinear features while being expensive to obtain due to costly real measurements or resource-consuming solutions of partial differential equations. To address this difficulty, we propose a technique that harnesses the best of both worlds: an autoencoder that leverages PCA to perform well on scarce nonlinear data. Specifically, we outline a numerically robust PCA-based initialization of AE, which, together with the parameterized ReLU activation function, allows the training process to start from an exact PCA solution and improve upon it. A synthetic example is presented first to study the effects of data nonlinearity and size on the performance of the proposed method. We then evaluate our method on several nanophotonic component design problems where obtaining useful data is expensive. To demonstrate universality, we also apply it to tasks in other scientific domains: a benchmark breast cancer dataset and a gene expression dataset. We show that our proposed approach is substantially better than both PCA and randomly initialized AE in the majority of low-data regime cases we consider, or at least is comparable to the best of either of the other two methods.
    Machine Learning for Electricity Market Clearing. (arXiv:2205.11641v1 [eess.SY])
    This paper seeks to design a machine learning twin of the optimal power flow (OPF) optimization, which is used in market-clearing procedures by wholesale electricity markets. The motivation for the proposed approach stems from the need to obtain the digital twin, which is much faster than the original, while also being sufficiently accurate and producing consistent generation dispatches and locational marginal prices (LMPs), which are primal and dual solutions of the OPF optimization, respectively. Availability of market-clearing tools based on this approach will enable computationally tractable evaluation of multiple dispatch scenarios under a given unit commitment. Rather than direct solution of OPF, the Karush-Kuhn-Tucker (KKT) conditions for the OPF problem in question may be written, and in parallel the LMPs of generators and loads may be expressed in terms of the OPF Lagrangian multipliers. Also, taking advantage of the practical fact that many of the Lagrangian multipliers associated with lines will be zero (thermal limits are not binding), we build and train an ML scheme which maps flexible resources (loads and renewables) to the binding lines, and supplement it with an efficient power-grid aware linear map to optimal dispatch and LMPs. The scheme is validated and illustrated on IEEE models. We also report a trade of analysis between quality of the reconstruction and number of samples needed to train the model.
    FedSA: Accelerating Intrusion Detection in Collaborative Environments with Federated Simulated Annealing. (arXiv:2205.11519v1 [cs.CR])
    Fast identification of new network attack patterns is crucial for improving network security. Nevertheless, identifying an ongoing attack in a heterogeneous network is a non-trivial task. Federated learning emerges as a solution to collaborative training for an Intrusion Detection System (IDS). The federated learning-based IDS trains a global model using local machine learning models provided by federated participants without sharing local data. However, optimization challenges are intrinsic to federated learning. This paper proposes the Federated Simulated Annealing (FedSA) metaheuristic to select the hyperparameters and a subset of participants for each aggregation round in federated learning. FedSA optimizes hyperparameters linked to the global model convergence. The proposal reduces aggregation rounds and speeds up convergence. Thus, FedSA accelerates learning extraction from local models, requiring fewer IDS updates. The proposal assessment shows that the FedSA global model converges in less than ten communication rounds. The proposal requires up to 50% fewer aggregation rounds to achieve approximately 97% accuracy in attack detection than the conventional aggregation approach.
    Forecasting of Non-Stationary Sales Time Series Using Deep Learning. (arXiv:2205.11636v1 [cs.LG])
    The paper describes the deep learning approach for forecasting non-stationary time series with using time trend correction in a neural network model. Along with the layers for predicting sales values, the neural network model includes a subnetwork block for the prediction weight for a time trend term which is added to a predicted sales value. The time trend term is considered as a product of the predicted weight value and normalized time value. The results show that the forecasting accuracy can be essentially improved for non-stationary sales with time trends using the trend correction block in the deep learning model.
    An interpretation of the final fully connected layer. (arXiv:2205.11908v1 [cs.LG])
    In recent years neural networks have achieved state-of-the-art accuracy for various tasks but the the interpretation of the generated outputs still remains difficult. In this work we attempt to provide a method to understand the learnt weights in the final fully connected layer in image classification models. We motivate our method by drawing a connection between the policy gradient objective in RL and supervised learning objective. We suggest that the commonly used cross entropy based supervised learning objective can be regarded as a special case of the policy gradient objective. Using this insight we propose a method to find the most discriminative and confusing parts of an image. Our method does not make any prior assumption about neural network achitecture and has low computational cost. We apply our method on publicly available pre-trained models and report the generated results.
    Identifying (anti-)skyrmions while they form. (arXiv:2205.11535v1 [cond-mat.str-el])
    We use a Convolutional Neural Network (CNN) to identify the relevant features in the thermodynamical phases of a simulated three-dimensional spin-lattice system with ferromagnetic and Dzyaloshinskii-Moriya (DM) interactions. Such features include (anti-)skyrmions, merons, and helical and ferromagnetic states. We use a multi-label classification framework, which is flexible enough to accommodate states that mix different features and phases. We then train the CNN to predict the features of the final state from snapshots of intermediate states of the simulation. The trained model allows identifying the different phases reliably and early in the formation process. Thus, the CNN can significantly speed up the phase diagram calculations by predicting the final phase before the spin-lattice Monte Carlo sampling has converged. We show the prowess of this approach by generating phase diagrams with significantly shorter simulation times.
    Learning Context-Aware Service Representation for Service Recommendation in Workflow Composition. (arXiv:2205.11771v1 [cs.SE])
    As increasingly more software services have been published onto the Internet, it remains a significant challenge to recommend suitable services to facilitate scientific workflow composition. This paper proposes a novel NLP-inspired approach to recommending services throughout a workflow development process, based on incrementally learning latent service representation from workflow provenance. A workflow composition process is formalized as a step-wise, context-aware service generation procedure, which is mapped to next-word prediction in a natural language sentence. Historical service dependencies are extracted from workflow provenance to build and enrich a knowledge graph. Each path in the knowledge graph reflects a scenario in a data analytics experiment, which is analogous to a sentence in a conversation. All paths are thus formalized as composable service sequences and are mined, using various patterns, from the established knowledge graph to construct a corpus. Service embeddings are then learned by applying deep learning model from the NLP field. Extensive experiments on the real-world dataset demonstrate the effectiveness and efficiency of the approach.
    G-Rep: Gaussian Representation for Arbitrary-Oriented Object Detection. (arXiv:2205.11796v1 [cs.CV])
    Arbitrary-oriented object representations contain the oriented bounding box (OBB), quadrilateral bounding box (QBB), and point set (PointSet). Each representation encounters problems that correspond to its characteristics, such as the boundary discontinuity, square-like problem, representation ambiguity, and isolated points, which lead to inaccurate detection. Although many effective strategies have been proposed for various representations, there is still no unified solution. Current detection methods based on Gaussian modeling have demonstrated the possibility of breaking this dilemma; however, they remain limited to OBB. To go further, in this paper, we propose a unified Gaussian representation called G-Rep to construct Gaussian distributions for OBB, QBB, and PointSet, which achieves a unified solution to various representations and problems. Specifically, PointSet or QBB-based objects are converted into Gaussian distributions, and their parameters are optimized using the maximum likelihood estimation algorithm. Then, three optional Gaussian metrics are explored to optimize the regression loss of the detector because of their excellent parameter optimization mechanisms. Furthermore, we also use Gaussian metrics for sampling to align label assignment and regression loss. Experimental results on several public available datasets, DOTA, HRSC2016, UCAS-AOD, and ICDAR2015 show the excellent performance of the proposed method for arbitrary-oriented object detection. The code has been open sourced at https://github.com/open-mmlab/mmrotate.
    Identifying Patient-Specific Root Causes of Disease. (arXiv:2205.11627v1 [stat.ML])
    Complex diseases are caused by a multitude of factors that may differ between patients. As a result, hypothesis tests comparing all patients to all healthy controls can detect many significant variables with inconsequential effect sizes. A few highly predictive root causes may nevertheless generate disease within each patient. In this paper, we define patient-specific root causes as variables subject to exogenous "shocks" which go on to perturb an otherwise healthy system and induce disease. In other words, the variables are associated with the exogenous errors of a structural equation model (SEM), and these errors predict a downstream diagnostic label. We quantify predictivity using sample-specific Shapley values. This derivation allows us to develop a fast algorithm called Root Causal Inference for identifying patient-specific root causes by extracting the error terms of a linear SEM and then computing the Shapley value associated with each error. Experiments highlight considerable improvements in accuracy because the method uncovers root causes that may have large effect sizes at the individual level but clinically insignificant effect sizes at the group level. An R implementation is available at github.com/ericstrobl/RCI.
  • Open

    Forecasting Multilinear Data via Transform-Based Tensor Autoregression. (arXiv:2205.12201v1 [cs.LG])
    In the era of big data, there is an increasing demand for new methods for analyzing and forecasting 2-dimensional data. The current research aims to accomplish these goals through the combination of time-series modeling and multilinear algebraic systems. We expand previous autoregressive techniques to forecast multilinear data, aptly named the L-Transform Tensor autoregressive (L-TAR for short). Tensor decompositions and multilinear tensor products have allowed for this approach to be a feasible method of forecasting. We achieve statistical independence between the columns of the observations through invertible discrete linear transforms, enabling a divide and conquer approach. We present an experimental validation of the proposed methods on datasets containing image collections, video sequences, sea surface temperature measurements, stock prices, and networks.
    Pessimistic Minimax Value Iteration: Provably Efficient Equilibrium Learning from Offline Datasets. (arXiv:2202.07511v2 [cs.LG] UPDATED)
    We study episodic two-player zero-sum Markov games (MGs) in the offline setting, where the goal is to find an approximate Nash equilibrium (NE) policy pair based on a dataset collected a priori. When the dataset does not have uniform coverage over all policy pairs, finding an approximate NE involves challenges in three aspects: (i) distributional shift between the behavior policy and the optimal policy, (ii) function approximation to handle large state space, and (iii) minimax optimization for equilibrium solving. We propose a pessimism-based algorithm, dubbed as pessimistic minimax value iteration (PMVI), which overcomes the distributional shift by constructing pessimistic estimates of the value functions for both players and outputs a policy pair by solving NEs based on the two value functions. Furthermore, we establish a data-dependent upper bound on the suboptimality which recovers a sublinear rate without the assumption on uniform coverage of the dataset. We also prove an information-theoretical lower bound, which suggests that the data-dependent term in the upper bound is intrinsic. Our theoretical results also highlight a notion of "relative uncertainty", which characterizes the necessary and sufficient condition for achieving sample efficiency in offline MGs. To the best of our knowledge, we provide the first nearly minimax optimal result for offline MGs with function approximation.
    Bayesian Target-Vector Optimization for Efficient Parameter Reconstruction. (arXiv:2202.11559v2 [physics.comp-ph] UPDATED)
    Parameter reconstructions are indispensable in metrology. Here, the objective is to to explain $K$ experimental measurements by fitting to them a parameterized model of the measurement process. The model parameters are regularly determined by least-square methods, i.e., by minimizing the sum of the squared residuals between the $K$ model predictions and the $K$ experimental observations, $\chi^2$. The model functions often involve computationally demanding numerical simulations. Bayesian optimization methods are specifically suited for minimizing expensive model functions. However, in contrast to least-square methods such as the Levenberg-Marquardt algorithm, they only take the value of $\chi^2$ into account, and neglect the $K$ individual model outputs. We present a Bayesian target-vector optimization scheme with improved performance over previous developments, that considers all $K$ contributions of the model function and that is specifically suited for parameter reconstruction problems which are often based on hundreds of observations. Its performance is compared to established methods for an optical metrology reconstruction problem and two synthetic least-squares problems. The proposed method outperforms established optimization methods. It also enables to determine accurate uncertainty estimates with very few observations of the actual model function by using Markov chain Monte Carlo sampling on a trained surrogate model.
    SepIt Approaching a Single Channel Speech Separation Bound. (arXiv:2205.11801v1 [eess.AS])
    We present an upper bound for the Single Channel Speech Separation task, which is based on an assumption regarding the nature of short segments of speech. Using the bound, we are able to show that while the recent methods have made significant progress for a few speakers, there is room for improvement for five and ten speakers. We then introduce a Deep neural network, SepIt, that iteratively improves the different speakers' estimation. At test time, SpeIt has a varying number of iterations per test sample, based on a mutual information criterion that arises from our analysis. In an extensive set of experiments, SepIt outperforms the state-of-the-art neural networks for 2, 3, 5, and 10 speakers.
    One-Pixel Shortcut: on the Learning Preference of Deep Neural Networks. (arXiv:2205.12141v1 [cs.LG])
    Unlearnable examples (ULEs) aim to protect data from unauthorized usage for training DNNs. Error-minimizing noise, which is injected to clean data, is one of the most successful methods for preventing DNNs from giving correct predictions on incoming new data. Nonetheless, under specific training strategies such as adversarial training, the unlearnability of error-minimizing noise will severely degrade. In addition, the transferability of error-minimizing noise is inherently limited by the mismatch between the generator model and the targeted learner model. In this paper, we investigate the mechanism of unlearnable examples and propose a novel model-free method, named \emph{One-Pixel Shortcut}, which only perturbs a single pixel of each image and makes the dataset unlearnable. Our method needs much less computational cost and obtains stronger transferability and thus can protect data from a wide range of different models. Based on this, we further introduce the first unlearnable dataset called CIFAR-10-S, which is indistinguishable from normal CIFAR-10 by human observers and can serve as a benchmark for different models or training strategies to evaluate their abilities to extract critical features from the disturbance of non-semantic representations. The original error-minimizing ULEs will lose efficiency under adversarial training, where the model can get over 83\% clean test accuracy. Meanwhile, even if adversarial training and strong data augmentation like RandAugment are applied together, the model trained on CIFAR-10-S cannot get over 50\% clean test accuracy.
    Factor Analysis, Probabilistic Principal Component Analysis, Variational Inference, and Variational Autoencoder: Tutorial and Survey. (arXiv:2101.00734v2 [stat.ML] UPDATED)
    This is a tutorial and survey paper on factor analysis, probabilistic Principal Component Analysis (PCA), variational inference, and Variational Autoencoder (VAE). These methods, which are tightly related, are dimensionality reduction and generative models. They assume that every data point is generated from or caused by a low-dimensional latent factor. By learning the parameters of distribution of latent space, the corresponding low-dimensional factors are found for the sake of dimensionality reduction. For their stochastic and generative behaviour, these models can also be used for generation of new data points in the data space. In this paper, we first start with variational inference where we derive the Evidence Lower Bound (ELBO) and Expectation Maximization (EM) for learning the parameters. Then, we introduce factor analysis, derive its joint and marginal distributions, and work out its EM steps. Probabilistic PCA is then explained, as a special case of factor analysis, and its closed-form solutions are derived. Finally, VAE is explained where the encoder, decoder and sampling from the latent space are introduced. Training VAE using both EM and backpropagation are explained.
    On the identifiability of mixtures of ranking models. (arXiv:2201.13132v2 [cs.LG] UPDATED)
    Mixtures of ranking models are standard tools for ranking problems. However, even the fundamental question of parameter identifiability is not fully understood: the identifiability of a mixture model with two Bradley-Terry-Luce (BTL) components has remained open. In this work, we show that popular mixtures of ranking models with two components (BTL, multinomial logistic models with slates of size 3, or Plackett-Luce) are generically identifiable, i.e., the ground-truth parameters can be identified except when they are from a pathological subset of measure zero. We provide a framework for verifying the number of solutions in a general family of polynomial systems using algebraic geometry, and apply it to these mixtures of ranking models to establish generic identifiability. The framework can be applied more broadly to other learning models and may be of independent interest.
    Random Feature Amplification: Feature Learning and Generalization in Neural Networks. (arXiv:2202.07626v2 [cs.LG] UPDATED)
    In this work, we provide a characterization of the feature-learning process in two-layer ReLU networks trained by gradient descent on the logistic loss following random initialization. We consider data with binary labels that are generated by an XOR-like function of the input features. We permit a constant fraction of the training labels to be corrupted by an adversary. We show that, although linear classifiers are no better than random guessing for the distribution we consider, two-layer ReLU networks trained by gradient descent achieve generalization error close to the label noise rate. We develop a novel proof technique that shows that at initialization, the vast majority of neurons function as random features that are only weakly correlated with useful features, and the gradient descent dynamics 'amplify' these weak, random features to strong, useful features.
    Randomly Initialized One-Layer Neural Networks Make Data Linearly Separable. (arXiv:2205.11716v1 [cs.LG])
    Recently, neural networks have been shown to perform exceptionally well in transforming two arbitrary sets into two linearly separable sets. Doing this with a randomly initialized neural network is of immense interest because the associated computation is cheaper than using fully trained networks. In this paper, we show that, with sufficient width, a randomly initialized one-layer neural network transforms two sets into two linearly separable sets with high probability. Furthermore, we provide explicit bounds on the required width of the neural network for this to occur. Our first bound is exponential in the input dimension and polynomial in all other parameters, while our second bound is independent of the input dimension, thereby overcoming the curse of dimensionality. We also perform an experimental study comparing the separation capacity of randomly initialized one-layer and two-layer neural networks. With correctly chosen biases, our study shows for low-dimensional data, the two-layer neural network outperforms the one-layer network. However, the opposite is observed for higher-dimensional data.
    Weak Convergence of Approximate reflection coupling and its Application to Non-convex Optimization. (arXiv:2205.11970v1 [math.PR])
    In this paper, we propose a weak approximation of the reflection coupling (RC) for stochastic differential equations (SDEs), and prove it converges weakly to the desired coupling. In contrast to the RC, the proposed approximate reflection coupling (ARC) need not take the hitting time of processes to the diagonal set into consideration and can be defined as the solution of some SDEs on the whole time interval. Therefore, ARC can work effectively against SDEs with different drift terms. As an application of ARC, an evaluation on the effectiveness of the stochastic gradient descent in a non-convex setting is also described. For the sample size $n$, the step size $\eta$, and the batch size $B$, we derive uniform evaluations on the time with orders $n^{-1}$, $\eta^{1/2}$, and $\sqrt{(n - B) / B (n - 1)}$, respectively.
    Learning Interacting Dynamical Systems with Latent Gaussian Process ODEs. (arXiv:2205.11894v1 [cs.LG])
    We study for the first time uncertainty-aware modeling of continuous-time dynamics of interacting objects. We introduce a new model that decomposes independent dynamics of single objects accurately from their interactions. By employing latent Gaussian process ordinary differential equations, our model infers both independent dynamics and their interactions with reliable uncertainty estimates. In our formulation, each object is represented as a graph node and interactions are modeled by accumulating the messages coming from neighboring objects. We show that efficient inference of such a complex network of variables is possible with modern variational sparse Gaussian process inference techniques. We empirically demonstrate that our model improves the reliability of long-term predictions over neural network based alternatives and it successfully handles missing dynamic or static information. Furthermore, we observe that only our model can successfully encapsulate independent dynamics and interaction information in distinct functions and show the benefit from this disentanglement in extrapolation scenarios.
    History Compression via Language Models in Reinforcement Learning. (arXiv:2205.12258v1 [cs.LG])
    In a partially observable Markov decision process (POMDP), an agent typically uses a representation of the past to approximate the underlying MDP. We propose to utilize a frozen Pretrained Language Transformer (PLT) for history representation and compression to improve sample efficiency. To avoid training of the Transformer, we introduce FrozenHopfield, which automatically associates observations with original token embeddings. To form these associations, a modern Hopfield network stores the original token embeddings, which are retrieved by queries that are obtained by a random but fixed projection of observations. Our new method, HELM, enables actor-critic network architectures that contain a pretrained language Transformer for history representation as a memory module. Since a representation of the past need not be learned, HELM is much more sample efficient than competitors. On Minigrid and Procgen environments HELM achieves new state-of-the-art results. Our code is available at https://github.com/ml-jku/helm.
    Bellman-consistent Pessimism for Offline Reinforcement Learning. (arXiv:2106.06926v5 [cs.LG] UPDATED)
    The use of pessimism, when reasoning about datasets lacking exhaustive exploration has recently gained prominence in offline reinforcement learning. Despite the robustness it adds to the algorithm, overly pessimistic reasoning can be equally damaging in precluding the discovery of good policies, which is an issue for the popular bonus-based pessimism. In this paper, we introduce the notion of Bellman-consistent pessimism for general function approximation: instead of calculating a point-wise lower bound for the value function, we implement pessimism at the initial state over the set of functions consistent with the Bellman equations. Our theoretical guarantees only require Bellman closedness as standard in the exploratory setting, in which case bonus-based pessimism fails to provide guarantees. Even in the special case of linear function approximation where stronger expressivity assumptions hold, our result improves upon a recent bonus-based approach by $\mathcal{O}(d)$ in its sample complexity when the action space is finite. Remarkably, our algorithms automatically adapt to the best bias-variance tradeoff in the hindsight, whereas most prior approaches require tuning extra hyperparameters a priori.
    MAGMA: Inference and Prediction with Multi-Task Gaussian Processes. (arXiv:2007.10731v2 [stat.CO] UPDATED)
    A novel multi-task Gaussian process (GP) framework is proposed, by using a common mean process for sharing information across tasks. In particular, we investigate the problem of time series forecasting, with the objective to improve multiple-step-ahead predictions. The common mean process is defined as a GP for which the hyper-posterior distribution is tractable. Therefore an EM algorithm is derived for handling both hyper-parameters optimisation and hyper-posterior computation. Unlike previous approaches in the literature, the model fully accounts for uncertainty and can handle irregular grids of observations while maintaining explicit formulations, by modelling the mean process in a unified GP framework. Predictive analytical equations are provided, integrating information shared across tasks through a relevant prior mean. This approach greatly improves the predictive performances, even far from observations, and may reduce significantly the computational complexity compared to traditional multi-task GP models. Our overall algorithm is called \textsc{Magma} (standing for Multi tAsk Gaussian processes with common MeAn). The quality of the mean process estimation, predictive performances, and comparisons to alternatives are assessed in various simulated scenarios and on real datasets.
    Quasi-Equivalence of Width and Depth of Neural Networks. (arXiv:2002.02515v7 [cs.LG] UPDATED)
    While classic studies proved that wide networks allow universal approximation, recent research and successes of deep learning demonstrate the power of deep networks. Based on a symmetric consideration, we investigate if the design of artificial neural networks should have a directional preference, and what the mechanism of interaction is between the width and depth of a network. Inspired by the De Morgan law, we address this fundamental question by establishing a quasi-equivalence between the width and depth of ReLU networks in two aspects. First, we formulate two transforms for mapping an arbitrary ReLU network to a wide network and a deep network respectively for either regression or classification so that the essentially same capability of the original network can be implemented. Then, we replace the mainstream artificial neuron type with a quadratic counterpart, and utilize the factorization and continued fraction representations of the same polynomial function to construct a wide network and a deep network, respectively. Based on our findings, a deep network has a wide equivalent, and vice versa, subject to an arbitrarily small error.
    Nonnegative Tensor Completion via Integer Optimization. (arXiv:2111.04580v2 [cs.LG] UPDATED)
    Unlike matrix completion, tensor completion does not have an algorithm that is known to achieve the information-theoretic sample complexity rate. This paper develops a new algorithm for the special case of completion for nonnegative tensors. We prove that our algorithm converges in a linear (in numerical tolerance) number of oracle steps, while achieving the information-theoretic rate. Our approach is to define a new norm for nonnegative tensors using the gauge of a particular 0-1 polytope; integer linear programming can, in turn, be used to solve linear separation problems over this polytope. We combine this insight with a variant of the Frank-Wolfe algorithm to construct our numerical algorithm, and we demonstrate its effectiveness and scalability through computational experiments using a laptop on tensors with up to one-hundred million entries.
    Stochastic Neural Networks with Infinite Width are Deterministic. (arXiv:2201.12724v2 [cs.LG] UPDATED)
    This work theoretically studies stochastic neural networks, a main type of neural network in use. We prove that as the width of an optimized stochastic neural network tends to infinity, its predictive variance on the training set decreases to zero. Our theory justifies the common intuition that adding stochasticity to the model can help regularize the model by introducing an averaging effect. Two common examples that our theory can be relevant to are neural networks with dropout and Bayesian latent variable models in a special limit. Our result thus helps better understand how stochasticity affects the learning of neural networks and potentially design better architectures for practical problems.
    From Predictions to Decisions: The Importance of Joint Predictive Distributions. (arXiv:2107.09224v3 [cs.LG] UPDATED)
    A fundamental challenge for any intelligent system is prediction: given some inputs, can you predict corresponding outcomes? Most work on supervised learning has focused on producing accurate marginal predictions for each input. However, we show that for a broad class of decision problems, accurate joint predictions are required to deliver good performance. In particular, we establish several results pertaining to combinatorial decision problems, sequential predictions, and multi-armed bandits to elucidate the essential role of joint predictive distributions. Our treatment of multi-armed bandits introduces an approximate Thompson sampling algorithm and analytic techniques that lead to a new kind of regret bound.
    Logarithmic regret bounds for continuous-time average-reward Markov decision processes. (arXiv:2205.11168v2 [cs.LG] UPDATED)
    We consider reinforcement learning for continuous-time Markov decision processes (MDPs) in the infinite-horizon, average-reward setting. In contrast to discrete-time MDPs, a continuous-time process moves to a state and stays there for a random holding time after an action is taken. With unknown transition probabilities and rates of exponential holding times, we derive instance-dependent regret lower bounds that are logarithmic in the time horizon. Moreover, we design a learning algorithm and establish a finite-time regret bound that achieves the logarithmic growth rate. Our analysis builds upon upper confidence reinforcement learning, a delicate estimation of the mean holding times, and stochastic comparison of point processes.
    Efficient and Robust Algorithms for Adversarial Linear Contextual Bandits. (arXiv:2002.00287v3 [cs.LG] UPDATED)
    We consider an adversarial variant of the classic $K$-armed linear contextual bandit problem where the sequence of loss functions associated with each arm are allowed to change without restriction over time. Under the assumption that the $d$-dimensional contexts are generated i.i.d.~at random from a known distributions, we develop computationally efficient algorithms based on the classic Exp3 algorithm. Our first algorithm, RealLinExp3, is shown to achieve a regret guarantee of $\widetilde{O}(\sqrt{KdT})$ over $T$ rounds, which matches the best available bound for this problem. Our second algorithm, RobustLinExp3, is shown to be robust to misspecification, in that it achieves a regret bound of $\widetilde{O}((Kd)^{1/3}T^{2/3}) + \varepsilon \sqrt{d} T$ if the true reward function is linear up to an additive nonlinear error uniformly bounded in absolute value by $\varepsilon$. To our knowledge, our performance guarantees constitute the very first results on this problem setting.
    Not too little, not too much: a theoretical analysis of graph (over)smoothing. (arXiv:2205.12156v1 [stat.ML])
    We analyze graph smoothing with \emph{mean aggregation}, where each node successively receives the average of the features of its neighbors. Indeed, it has quickly been observed that Graph Neural Networks (GNNs), which generally follow some variant of Message-Passing (MP) with repeated aggregation, may be subject to the \emph{oversmoothing} phenomenon: by performing too many rounds of MP, the node features tend to converge to a non-informative limit. In the case of mean aggregation, for connected graphs, the node features become constant across the whole graph. At the other end of the spectrum, it is intuitively obvious that \emph{some} MP rounds are necessary, but existing analyses do not exhibit both phenomena at once: beneficial ``finite'' smoothing and oversmoothing in the limit. In this paper, we consider simplified linear GNNs, and rigorously analyze two examples for which a finite number of mean aggregation steps provably improves the learning performance, before oversmoothing kicks in. We consider a latent space random graph model, where node features are partial observations of the latent variables and the graph contains pairwise relationships between them. We show that graph smoothing restores some of the lost information, up to a certain point, by two phenomenon: graph smoothing shrinks non-principal directions in the data faster than principal ones, which is useful for regression, and shrinks nodes within communities faster than they collapse together, which improves classification.
    Distributional Hamilton-Jacobi-Bellman Equations for Continuous-Time Reinforcement Learning. (arXiv:2205.12184v1 [cs.LG])
    Continuous-time reinforcement learning offers an appealing formalism for describing control problems in which the passage of time is not naturally divided into discrete increments. Here we consider the problem of predicting the distribution of returns obtained by an agent interacting in a continuous-time, stochastic environment. Accurate return predictions have proven useful for determining optimal policies for risk-sensitive control, learning state representations, multiagent coordination, and more. We begin by establishing the distributional analogue of the Hamilton-Jacobi-Bellman (HJB) equation for It\^o diffusions and the broader class of Feller-Dynkin processes. We then specialize this equation to the setting in which the return distribution is approximated by $N$ uniformly-weighted particles, a common design choice in distributional algorithms. Our derivation highlights additional terms due to statistical diffusivity which arise from the proper handling of distributions in the continuous-time setting. Based on this, we propose a tractable algorithm for approximately solving the distributional HJB based on a JKO scheme, which can be implemented in an online control algorithm. We demonstrate the effectiveness of such an algorithm in a synthetic control problem.
    Stereographic Markov Chain Monte Carlo. (arXiv:2205.12112v1 [stat.CO])
    High dimensional distributions, especially those with heavy tails, are notoriously difficult for off-the-shelf MCMC samplers: the combination of unbounded state spaces, diminishing gradient information, and local moves, results in empirically observed "stickiness" and poor theoretical mixing properties -- lack of geometric ergodicity. In this paper, we introduce a new class of MCMC samplers that map the original high dimensional problem in Euclidean space onto a sphere and remedy these notorious mixing problems. In particular, we develop random-walk Metropolis type algorithms as well as versions of Bouncy Particle Sampler that are uniformly ergodic for a large class of light and heavy-tailed distributions and also empirically exhibit rapid convergence in high dimensions. In the best scenario, the proposed samplers can enjoy the ``blessings of dimensionality'' that the mixing time decreases with dimension.
    Dimension-agnostic inference using cross U-statistics. (arXiv:2011.05068v4 [math.ST] UPDATED)
    Classical asymptotic theory for statistical inference usually involves calibrating a statistic by fixing the dimension $d$ while letting the sample size $n$ increase to infinity. Recently, much effort has been dedicated towards understanding how these methods behave in high-dimensional settings, where $d$ and $n$ both increase to infinity together. This often leads to different inference procedures, depending on the assumptions about the dimensionality, leaving the practitioner in a bind: given a dataset with 100 samples in 20 dimensions, should they calibrate by assuming $n \gg d$, or $d/n \approx 0.2$? This paper considers the goal of dimension-agnostic inference; developing methods whose validity does not depend on any assumption on $d$ versus $n$. We introduce an approach that uses variational representations of existing test statistics along with sample splitting and self-normalization to produce a new test statistic with a Gaussian limiting distribution. The resulting statistic can be viewed as a careful modification of degenerate U-statistics, dropping diagonal blocks and retaining off-diagonal blocks. We exemplify our technique for a handful of classical problems including one-sample mean and covariance testing. Our tests are shown to have minimax rate-optimal power against appropriate local alternatives, and their power is optimal up to a $\sqrt 2$ factor. We end by suggesting some next steps for extending dimension-agnostic inference to other problems.
    Semi-Supervised Clustering of Sparse Graphs: Crossing the Information-Theoretic Threshold. (arXiv:2205.11677v1 [stat.ML])
    The stochastic block model is a canonical random graph model for clustering and community detection on network-structured data. Decades of extensive study on the problem have established many profound results, among which the phase transition at the Kesten-Stigum threshold is particularly interesting both from a mathematical and an applied standpoint. It states that no estimator based on the network topology can perform substantially better than chance on sparse graphs if the model parameter is below certain threshold. Nevertheless, if we slightly extend the horizon to the ubiquitous semi-supervised setting, such a fundamental limitation will disappear completely. We prove that with arbitrary fraction of the labels revealed, the detection problem is feasible throughout the parameter domain. Moreover, we introduce two efficient algorithms, one combinatorial and one based on optimization, to integrate label information with graph structures. Our work brings a new perspective to stochastic model of networks and semidefinite program research.  ( 2 min )
    Eigenvalue and Generalized Eigenvalue Problems: Tutorial. (arXiv:1903.11240v2 [stat.ML] UPDATED)
    This paper is a tutorial for eigenvalue and generalized eigenvalue problems. We first introduce eigenvalue problem, eigen-decomposition (spectral decomposition), and generalized eigenvalue problem. Then, we mention the optimization problems which yield to the eigenvalue and generalized eigenvalue problems. We also provide examples from machine learning, including principal component analysis, kernel supervised principal component analysis, and Fisher discriminant analysis, which result in eigenvalue and generalized eigenvalue problems. Finally, we introduce the solutions to both eigenvalue and generalized eigenvalue problems.  ( 2 min )
    Bayesian Calibration of imperfect computer models using Physics-informed priors. (arXiv:2201.06463v2 [stat.ML] UPDATED)
    We introduce a computational efficient data-driven framework suitable for quantifying the uncertainty in physical parameters and model formulation of computer models, represented by differential equations. We construct physics-informed priors, which are multi-output GP priors that encode the model's structure in the covariance function. We extend this into a fully Bayesian framework that quantifies the uncertainty of physical parameters and model predictions. Since physical models often are imperfect descriptions of the real process, we allow the model to deviate from the observed data by considering a discrepancy function. To obtain the posterior distributions, we use Hamiltonian Monte Carlo sampling. We demonstrate our approach in a simulation study with hemodynamical models, which are time-dependent differential equations. Data are simulated from a more complex model than our modelling choice, and the aim is to learn physical parameters according to known mathematical connections. To demonstrate the flexibility of our approach, an example using the Heat equation, a space-time dependent differential equation where we consider a case of a biased data-acquisition process is also included. Finally, we fit the hemodynamic model using real data obtained in a medical trial.  ( 2 min )
    Risk-Sensitive Reinforcement Learning via Policy Gradient Search. (arXiv:1810.09126v3 [cs.LG] UPDATED)
    The objective in a traditional reinforcement learning (RL) problem is to find a policy that optimizes the expected value of a performance metric such as the infinite-horizon cumulative discounted or long-run average cost/reward. In practice, optimizing the expected value alone may not be satisfactory, in that it may be desirable to incorporate the notion of risk into the optimization problem formulation, either in the objective or as a constraint. Various risk measures have been proposed in the literature, e.g., exponential utility, variance, percentile performance, chance constraints, value at risk (quantile), conditional value-at-risk, prospect theory and its later enhancement, cumulative prospect theory. In this book, we consider risk-sensitive RL in two settings: one where the goal is to find a policy that optimizes the usual expected value objective while ensuring that a risk constraint is satisfied, and the other where the risk measure is the objective. We survey some of the recent work in this area specifically where policy gradient search is the solution approach. In the first risk-sensitive RL setting, we cover popular risk measures based on variance, conditional value-at-risk, and chance constraints, and present a template for policy gradient-based risk-sensitive RL algorithms using a Lagrangian formulation. For the setting where risk is incorporated directly into the objective function, we consider an exponential utility formulation, cumulative prospect theory, and coherent risk measures. This non-exhaustive survey aims to give a flavor of the challenges involved in solving risk-sensitive RL problems using policy gradient methods, as well as outlining some potential future research directions.  ( 2 min )
    Optimality Conditions and Algorithms for Top-K Arm Identification. (arXiv:2205.12086v1 [stat.ML])
    We consider the top-k arm identification problem for multi-armed bandits with rewards belonging to a one-parameter canonical exponential family. The objective is to select the set of k arms with the highest mean rewards by sequential allocation of sampling efforts. We propose a unified optimal allocation problem that identifies the complexity measures of this problem under the fixed-confidence, fixed-budget settings, and the posterior convergence rate from the Bayesian perspective. We provide the first characterization of its optimality. We provide the first provably optimal algorithm in the fixed-confidence setting for k>1. We also propose an efficient heuristic algorithm for the top-k arm identification problem. Extensive numerical experiments demonstrate superior performance compare to existing methods in all three settings.  ( 2 min )
    DeepKriging: Spatially Dependent Deep Neural Networks for Spatial Prediction. (arXiv:2007.11972v4 [stat.ML] UPDATED)
    In spatial statistics, a common objective is to predict values of a spatial process at unobserved locations by exploiting spatial dependence. Kriging provides the best linear unbiased predictor using covariance functions and is often associated with Gaussian processes. However, when considering non-linear prediction for non-Gaussian and categorical data, the Kriging prediction is no longer optimal, and the associated variance is often overly optimistic. Although deep neural networks (DNNs) are widely used for general classification and prediction, they have not been studied thoroughly for data with spatial dependence. In this work, we propose a novel DNN structure for spatial prediction, where the spatial dependence is captured by adding an embedding layer of spatial coordinates with basis functions. We show in theory and simulation studies that the proposed DeepKriging method has a direct link to Kriging in the Gaussian case, and it has multiple advantages over Kriging for non-Gaussian and non-stationary data, i.e., it provides non-linear predictions and thus has smaller approximation errors, it does not require operations on covariance matrices and thus is scalable for large datasets, and with sufficiently many hidden neurons, it provides the optimal prediction in terms of model capacity. We further explore the possibility of quantifying prediction uncertainties based on density prediction without assuming any data distribution. Finally, we apply the method to predicting PM2.5 concentrations across the continental United States.  ( 2 min )
    Quantum Kerr Learning. (arXiv:2205.12004v1 [quant-ph])
    Quantum machine learning is a rapidly evolving area that could facilitate important applications for quantum computing and significantly impact data science. In our work, we argue that a single Kerr mode might provide some extra quantum enhancements when using quantum kernel methods based on various reasons from complexity theory and physics. Furthermore, we establish an experimental protocol, which we call \emph{quantum Kerr learning} based on circuit QED. A detailed study using the kernel method, neural tangent kernel theory, first-order perturbation theory of the Kerr non-linearity, and non-perturbative numerical simulations, shows quantum enhancements could happen in terms of the convergence time and the generalization error, while explicit protocols are also constructed for higher-dimensional input data.  ( 2 min )
    Byzantine-Robust Federated Learning with Optimal Statistical Rates and Privacy Guarantees. (arXiv:2205.11765v1 [cs.LG])
    We propose Byzantine-robust federated learning protocols with nearly optimal statistical rates. In contrast to prior work, our proposed protocols improve the dimension dependence and achieve a tight statistical rate in terms of all the parameters for strongly convex losses. We benchmark against competing protocols and show the empirical superiority of the proposed protocols. Finally, we remark that our protocols with bucketing can be naturally combined with privacy-guaranteeing procedures to introduce security against a semi-honest server. The code for evaluation is provided in https://github.com/wanglun1996/secure-robust-federated-learning.  ( 2 min )
    Throwing Away Data Improves Worst-Class Error in Imbalanced Classification. (arXiv:2205.11672v1 [stat.ML])
    Class imbalances pervade classification problems, yet their treatment differs in theory and practice. On the one hand, learning theory instructs us that \emph{more data is better}, as sample size relates inversely to the average test error over the entire data distribution. On the other hand, practitioners have long developed a plethora of tricks to improve the performance of learning machines over imbalanced data. These include data reweighting and subsampling, synthetic construction of additional samples from minority classes, ensembling expensive one-versus all architectures, and tweaking classification losses and thresholds. All of these are efforts to minimize the worst-class error, which is often associated to the minority group in the training data, and finds additional motivation in the robustness, fairness, and out-of-distribution literatures. Here we take on the challenge of developing learning theory able to describe the worst-class error of classifiers over linearly-separable data when fitted either on (i) the full training set, or (ii) a subset where the majority class is subsampled to match in size the minority class. We borrow tools from extreme value theory to show that, under distributions with certain tail properties, \emph{throwing away most data from the majority class leads to better worst-class error}.  ( 2 min )
    EBM Life Cycle: MCMC Strategies for Synthesis, Defense, and Density Modeling. (arXiv:2205.12243v1 [stat.ML])
    This work presents strategies to learn an Energy-Based Model (EBM) according to the desired length of its MCMC sampling trajectories. MCMC trajectories of different lengths correspond to models with different purposes. Our experiments cover three different trajectory magnitudes and learning outcomes: 1) shortrun sampling for image generation; 2) midrun sampling for classifier-agnostic adversarial defense; and 3) longrun sampling for principled modeling of image probability densities. To achieve these outcomes, we introduce three novel methods of MCMC initialization for negative samples used in Maximum Likelihood (ML) learning. With standard network architectures and an unaltered ML objective, our MCMC initialization methods alone enable significant performance gains across the three applications that we investigate. Our results include state-of-the-art FID scores for unnormalized image densities on the CIFAR-10 and ImageNet datasets; state-of-the-art adversarial defense on CIFAR-10 among purification methods and the first EBM defense on ImageNet; and scalable techniques for learning valid probability densities. Code for this project can be found at https://github.com/point0bar1/ebm-life-cycle.  ( 2 min )
    Generalization Gap in Amortized Inference. (arXiv:2205.11640v1 [stat.ML])
    The ability of likelihood-based probabilistic models to generalize to unseen data is central to many machine learning applications such as lossless compression. In this work, we study the generalizations of a popular class of probabilistic models - the Variational Auto-Encoder (VAE). We point out the two generalization gaps that can affect the generalization ability of VAEs and show that the over-fitting phenomenon is usually dominated by the amortized inference network. Based on this observation we propose a new training objective, inspired by the classic wake-sleep algorithm, to improve the generalizations properties of amortized inference. We also demonstrate how it can improve generalization performance in the context of image modeling and lossless compression.  ( 2 min )
    Identifying Patient-Specific Root Causes of Disease. (arXiv:2205.11627v1 [stat.ML])
    Complex diseases are caused by a multitude of factors that may differ between patients. As a result, hypothesis tests comparing all patients to all healthy controls can detect many significant variables with inconsequential effect sizes. A few highly predictive root causes may nevertheless generate disease within each patient. In this paper, we define patient-specific root causes as variables subject to exogenous "shocks" which go on to perturb an otherwise healthy system and induce disease. In other words, the variables are associated with the exogenous errors of a structural equation model (SEM), and these errors predict a downstream diagnostic label. We quantify predictivity using sample-specific Shapley values. This derivation allows us to develop a fast algorithm called Root Causal Inference for identifying patient-specific root causes by extracting the error terms of a linear SEM and then computing the Shapley value associated with each error. Experiments highlight considerable improvements in accuracy because the method uncovers root causes that may have large effect sizes at the individual level but clinically insignificant effect sizes at the group level. An R implementation is available at github.com/ericstrobl/RCI.  ( 2 min )
    Soft-SVM Regression For Binary Classification. (arXiv:2205.11735v1 [stat.ML])
    The binomial deviance and the SVM hinge loss functions are two of the most widely used loss functions in machine learning. While there are many similarities between them, they also have their own strengths when dealing with different types of data. In this work, we introduce a new exponential family based on a convex relaxation of the hinge loss function using softness and class-separation parameters. This new family, denoted Soft-SVM, allows us to prescribe a generalized linear model that effectively bridges between logistic regression and SVM classification. This new model is interpretable and avoids data separability issues, attaining good fitting and predictive performance by automatically adjusting for data label separability via the softness parameter. These results are confirmed empirically through simulations and case studies as we compare regularized logistic, SVM, and Soft-SVM regressions and conclude that the proposed model performs well in terms of both classification and prediction errors.  ( 2 min )
    DIGRAC: Digraph Clustering Based on Flow Imbalance. (arXiv:2106.05194v6 [stat.ML] UPDATED)
    Node clustering is a powerful tool in the analysis of networks. We introduce a graph neural network framework to obtain node embeddings for directed networks in a self-supervised manner, including a novel probabilistic imbalance loss, which can be used for network clustering. Here, we propose directed flow imbalance measures, which are tightly related to directionality, to reveal clusters in the network even when there is no density difference between clusters. In contrast to standard approaches in the literature, in this paper, directionality is not treated as a nuisance, but rather contains the main signal. DIGRAC optimizes directed flow imbalance for clustering without requiring label supervision, unlike existing graph neural network methods, and can naturally incorporate node features, unlike existing spectral methods. Extensive experimental results on synthetic data, in the form of directed stochastic block models, and real-world data at different scales, demonstrate that our method, based on flow imbalance, attains state-of-the-art results on directed graph clustering when compared against 10 state-of-the-art methods from the literature, for a wide range of noise and sparsity levels, graph structures and topologies, and even outperforms supervised methods.  ( 2 min )
    A Quadrature Rule combining Control Variates and Adaptive Importance Sampling. (arXiv:2205.11890v1 [stat.ML])
    Driven by several successful applications such as in stochastic gradient descent or in Bayesian computation, control variates have become a major tool for Monte Carlo integration. However, standard methods do not allow the distribution of the particles to evolve during the algorithm, as is the case in sequential simulation methods. Within the standard adaptive importance sampling framework, a simple weighted least squares approach is proposed to improve the procedure with control variates. The procedure takes the form of a quadrature rule with adapted quadrature weights to reflect the information brought in by the control variates. The quadrature points and weights do not depend on the integrand, a computational advantage in case of multiple integrands. Moreover, the target density needs to be known only up to a multiplicative constant. Our main result is a non-asymptotic bound on the probabilistic error of the procedure. The bound proves that for improving the estimate's accuracy, the benefits from adaptive importance sampling and control variates can be combined. The good behavior of the method is illustrated empirically on synthetic examples and real-world data for Bayesian linear regression.  ( 2 min )
    Weakly-supervised Multi-output Regression via Correlated Gaussian Processes. (arXiv:2002.08412v2 [stat.ML] UPDATED)
    Multi-output regression seeks to borrow strength and leverage commonalities across different but related outputs in order to enhance learning and prediction accuracy. A fundamental assumption is that the output/group membership labels for all observations are known. This assumption is often violated in real applications. For instance, in healthcare datasets, sensitive attributes such as ethnicity are often missing or unreported. To this end, we introduce a weakly-supervised multi-output model based on dependent Gaussian processes. Our approach is able to leverage data without complete group labels or possibly only prior belief on group memberships to enhance accuracy across all outputs. Through intensive simulations and case studies on an Insulin, Testosterone and Bodyfat dataset, we show that our model excels in multi-output settings with missing labels, while being competitive in traditional fully labeled settings. We end by highlighting the possible use of our approach in fair inference and sequential decision-making.  ( 2 min )
    Advanced Manufacturing Configuration by Sample-efficient Batch Bayesian Optimization. (arXiv:2205.11827v1 [cs.LG])
    We propose a framework for the configuration and operation of expensive-to-evaluate advanced manufacturing methods, based on Bayesian optimization. The framework unifies a tailored acquisition function, a parallel acquisition procedure, and the integration of process information providing context to the optimization procedure. The novel acquisition function is demonstrated and analyzed on benchmark illustrative problems. We apply the optimization approach to atmospheric plasma spraying in simulation and experiments. Our results demonstrate that the proposed framework can efficiently find input parameters that produce the desired outcome and minimize the process cost.  ( 2 min )
    Bandwidth Selection for Gaussian Kernel Ridge Regression via Jacobian Control. (arXiv:2205.11956v1 [stat.ML])
    Most machine learning methods depend on the tuning of hyper-parameters. For kernel ridge regression (KRR) with the Gaussian kernel, the hyper-parameter is the bandwidth. The bandwidth specifies the length-scale of the kernel and has to be carefully selected in order to obtain a model with good generalization. The default method for bandwidth selection is cross-validation, which often yields good results, albeit at high computational costs. Furthermore, the estimates provided by cross-validation tend to have very high variance, especially when training data are scarce. Inspired by Jacobian regularization, we formulate how the derivatives of the functions inferred by KRR with the Gaussian kernel depend on the kernel bandwidth. We then use this expression to propose a closed-form, computationally feather-light, bandwidth selection method based on controlling the Jacobian. In addition, the Jacobian expression illuminates how the bandwidth selection is a trade-off between the smoothness of the inferred function, and the conditioning of the training data kernel matrix. We show on real and synthetic data that compared to cross-validation, our method is considerably more stable in terms of bandwidth selection, and, for small data sets, provides better predictions.  ( 2 min )
    uGLAD: Sparse graph recovery by optimizing deep unrolled networks. (arXiv:2205.11610v1 [cs.LG])
    Probabilistic Graphical Models (PGMs) are generative models of complex systems. They rely on conditional independence assumptions between variables to learn sparse representations which can be visualized in a form of a graph. Such models are used for domain exploration and structure discovery in poorly understood domains. This work introduces a novel technique to perform sparse graph recovery by optimizing deep unrolled networks. Assuming that the input data $X\in\mathbb{R}^{M\times D}$ comes from an underlying multivariate Gaussian distribution, we apply a deep model on $X$ that outputs the precision matrix $\Theta$, which can also be interpreted as the adjacency matrix. Our model, uGLAD, builds upon and extends the state-of-the-art model GLAD to the unsupervised setting. The key benefits of our model are (1) uGLAD automatically optimizes sparsity-related regularization parameters leading to better performance than existing algorithms. (2) We introduce multi-task learning based `consensus' strategy for robust handling of missing data in an unsupervised setting. We evaluate model results on synthetic Gaussian data, non-Gaussian data generated from Gene Regulatory Networks, and present a case study in anaerobic digestion.  ( 2 min )
    Quasi Black-Box Variational Inference with Natural Gradients for Bayesian Learning. (arXiv:2205.11568v1 [stat.ML])
    We develop an optimization algorithm suitable for Bayesian learning in complex models. Our approach relies on natural gradient updates within a general black-box framework for efficient training with limited model-specific derivations. It applies within the class of exponential-family variational posterior distributions, for which we extensively discuss the Gaussian case for which the updates have a rather simple form. Our Quasi Black-box Variational Inference (QBVI) framework is readily applicable to a wide class of Bayesian inference problems and is of simple implementation as the updates of the variational posterior do not involve gradients with respect to the model parameters, nor the prescription of the Fisher information matrix. We develop QBVI under different hypotheses for the posterior covariance matrix, discuss details about its robust and feasible implementation, and provide a number of real-world applications to demonstrate its effectiveness.  ( 2 min )

  • Open

    [D] What kind of motherboard do I need to run 2x 3090?
    Hey guys im a designer wanting to dip my toes into machine learning/Ai. 1- Is a mATX motherboard with 1x pciex16 and 1x pciex4 good enough for dual 3090? 2- do I absolutely need to use nvlink between these 3090s? what are the pros and cons? submitted by /u/Aeonbreak [link] [comments]  ( 1 min )
    [D] Google Speech to Text vs Building similar capability in house
    Hey All. I hope you're well. TL;DR I'm trying to compare the costs of building vs using existing technology For those of you who have experience building speech or signal processing ML / AI? I'm looking into building a SaaS feature that require speech to text. As an MVP, we've been using Google's speech to text API which is great - with a few problems - it's cost, and accuracy. While its accuracy is quite high, its cost, if I'm calculating correctly, relatively to our pricing, would be quite high. There are also benefits in terms of building accurate models for the specific industry we'd be using it. Does anyone have any examples of what it would take to build and operate something like that (costs / number of engineers) and more importantly operating (let's say - cost per minute to in computing power / storage) submitted by /u/Mobile_Jacket_894 [link] [comments]  ( 2 min )
    [D] Seeking advice for MLDS beginners
    I have an advanced degree in math, and am familiar with Python (from my college days). When I try tutorials for Sagemaker or Colab, it feels like I make progress, but spend 90% of my time trying to research why something is broken or not working. So, my question - is this the nature of this industry and software platforms at the moment? Are there any other tools/platforms that don't require constant troubleshooting for beginners? Thanks! submitted by /u/evergrowbro [link] [comments]  ( 1 min )
    [D] Training denoising diffusion model when forward process is unknown but we have multiple examples for each step?
    I hope the title at least kinda makes sense? More details below. ​ The way I understand denoising diffusion models on a high level is we simulate the forward process with a known distribution like gaussian noise and then learn the reverse process. I am wondering if it is reasonable to try training a diffusion model when we have a dataset from which we can sample any step of the forward process, but the actual forward process itself is unknown. Basically we have a training target for each "latent" in the diffusion process. ​ For the sake of simplicity we can think of our dataset as "images" and we can sample any image from the forward process. E.g we can sample the lowest quality images which are just random noise, medium quality images which contain the information we are interested in but obscured by noise or missing signal, or we can sample the highest quality images which is the final target of the model. We want a model which can map low/medium quality data to denoised high quality data; basically we are doing a super-resolution or image denoising task. We can generate any level of lower quality data we want, but the actual pixel-level difference between steps in the forward process is much more complex than just additive gaussian noise. ​ I already know it is *technically* feasible to train a diffusion model on this dataset since I am already testing it with varying levels of success. More what I am asking is if this makes any theoretical sense and where this might fit in with current literature on these models. Is it sufficient to just train an image-to-image model like a Unet to map each sample to the next in the markov chain since it might not be possible to evaluate KL divergence wrt the forward process? Or is there some way to approximate this as well? How is this any different from just training a stacked denoising autoencoder? ​ Thanks for any insight and/or discussion! submitted by /u/dimsycamore [link] [comments]  ( 2 min )
    [P]Made a community for freelancing in ML field
    Currently, freelancing in AI is still a niche and has a cold start problem no matter if you are freelancing yourself or being on a platform such as upwork, fivrr, etc. ​ Join the community and lets explore the alternative career path. https://www.reddit.com/r/ai_freelancing/ submitted by /u/meame2010 [link] [comments]  ( 1 min )
    [D] How does the accuracy of large language models compare to average humans? How does GPT-NeoX compare?
    I saw a graph comparing GPT-3, UnifiedQA and Gopher to Human Expert in level of accuracy on various fields. https://piped.kavin.rocks/watch?v=aPiHhJjN3hI&t=3m55s I was curious to know how does GPT-NeoX compare. And how do average humans perform? submitted by /u/ConsistentSense4760 [link] [comments]  ( 1 min )
    [P] Official Imagen Website by Google Brain
    https://imagen.research.google/ submitted by /u/margilly_ai [link] [comments]  ( 1 min )
    [Discussion] Is missing data still a problem?
    I'm curious to hear peoples' perspectives on this. As I understand it, the easiest and one of the most popular ways of "handling" missing data is to just throw away incomplete observations. Would spending time developing methods that can learn from incomplete observations still be useful? Or would you argue that, because there are so many settings in which data is so plentiful, it's perfectly fine at this point to just throw away incomplete observations? submitted by /u/vandelay_inds [link] [comments]  ( 1 min )
    [D] Visualizing loss surface in input space
    I am trying to recreate the surfaces from this paper "Interpreting Adversarial Robustness: A View from Decision Surface in Input Space" https://arxiv.org/abs/1810.00144 They briefly mention the reasoning for using input space instead of parameter space in section 2.2 but don't describe how they produce such surfaces in detail. I understand the conventional approach (http://arxiv.org/abs/1712.09913) of projecting into a 2D hyperplane with 2 random (normalized) vectors in parameter space, but struggle to think of a way to do this with input images (e.g. CIFAR10 dataset). I have also failed to find any implementations of this approach. Any advice? submitted by /u/sarfins [link] [comments]  ( 1 min )
    [P] Introducing BlindAI, an Open-source, fast and privacy-friendly AI deployment solution. Benefit from state-of-the-art AI without ever revealing your data!
    Hello everyone, We are pleased to introduce BlindAI to the AI community. BlindAI is an AI deployment solution, leveraging secure enclaves, to make remotely hosted AI models privacy friendly. Please have a look at our GitHub (https://github.com/mithril-security/blindai) to find out more! Motivation Today, most AI tools offer no privacy by design mechanisms, so when data is sent to be analysed by third parties, the data is exposed to malicious usage or potential leakage. We illustrate it below with the use of AI for voice assistants. Audio recordings are often sent to the Cloud to be analysed, leaving conversations exposed to leaks and uncontrolled usage without users’ knowledge or consent. ​ Before and after BlindAI By using BlindAI, data remains always protected as it is only decrypted inside a Trusted Execution Environment, called an enclave, whose contents are protected by hardware. While data is in clear inside the enclave, it is inaccessible to the outside thanks to isolation and memory encryption. This way, data can be processed, enriched, and analysed by AI, without exposing it to external parties. What you can do We have been able to run several state of the art models with privacy guarantees, enabling us to tackle complex scenarios, from privacy-friendly voice assistant with Wav2vec2, to confidential chest X-Ray analysis with ResNet, through document analysis with BERT. All of these models have been tested and can run with end-to-end protection under a second on an Intel(R) Xeon(R) Platinum 8370C. Model name Example use case Inference time (ms) DistilBERT Sentiment analysis 28.435 Wav2vec2 Speech to text 617.04 Facenet Facial recognition 47.135 A more detailed list of models we can deploy with privacy, with their run time, can be found here. If you like it drop a ⭐on our GitHub (https://github.com/mithril-security/blindai)! submitted by /u/Separate-Still3770 [link] [comments]  ( 1 min )
    [D] Best GP book
    Hi! As part of my research I’m using Gaussian Process (regression) but want to understand GPS more fully. Does anyone have any book recommendations that explain them well, either whole books on them or part of wider machine learning textbooks. Thanks submitted by /u/Ikeyt [link] [comments]  ( 1 min )
    [D] Deep Sets and Attention, what's the difference?
    I have stumbled across the relevant literature on neural networks for sets, e.g Deep Sets, PointerNet, and subsequent works. These architectures look similar to attention mechanisms and related models (e.g. Transformers), but they are treated as a very separated thing in the literature (e.g. in the Deep Sets paper the word "attention" never appears). So, I wonder, what is the main conceptual difference between Deep Sets and attention? I'm not very experienced with attention mechanisms, so I might be missing something in the big picture here. submitted by /u/fedetask [link] [comments]  ( 1 min )
    [D] Recent research and methods for time series forecasting
    Recent advances in Vision and NLP are dominating the AI community at the moment. I have been trying to find out if something exciting has been done for time series forecasting recently (last five years or so). Looking for some good starting points to keep up with the latest research. Any pointers would be highly appreciated. submitted by /u/ndalal01 [link] [comments]  ( 1 min )
    [D] What are some of the better one shot learning techniques for time series data?
    I am working on a project involving one shot learning for time series data from dynamical systems. Which would be some of the better ML models, I should look at for this purpose? submitted by /u/_hereforthecomments [link] [comments]  ( 1 min )
    [P] How to quantify the similarity/dissimilarity between two time series datasets?
    Hey guys, I was working with time series dataset for dynamical systems. I need to quantify how similar the datasets from two different dynamical systems are. So what would be the optimal parameter to do so? submitted by /u/_hereforthecomments [link] [comments]  ( 1 min )
    [D] finding the paper I want in the CVPR 2022.
    Hello guys, When I was finding for papers in the conference, object detection, I was surprised at the fact that there were above of 2000 papers. I usually read papers through a section of abstract, but this time can I only just open the papers to check out that my interest topics is there? Plz let me know. If I have to do that, I will. Thx. :) submitted by /u/Mundane_Definition_8 [link] [comments]  ( 1 min )
    [P] What we learned by making T5-large 2X faster than Pytorch (and any autoregressive transformer)
    TL;DR We made autoregressive transformer based models like T5-large 2X faster than 🤗 Hugging Face Pytorch with 3 simple tricks: storing 2 computation graphs in a single Onnx file 👯: this let us have both cache and no cache support without having any duplicated weights. When cache is used, attention switch from quadratic to linear complexity (less GPU computation) and Onnx Runtime brings us kernel fusion (less memory bound ops); zero copy 💥 to retrieve output from Onnx Runtime: we leverage Cupy API to access Onnx Runtime internal CUDA arrays and expose them through Dlpack to Pytorch. It may sound a bit complex, but it let us avoid output tensors copy which limit our memory footprint and make us much faster (check notebook for other benefits of this approach); a generic tool to conv…  ( 4 min )
    [D] Calculating Shannon Information of Data Augmentation Strategies
    I recently caught Andrew Ng's 2021 talk on MLOps (MLOps: From Model-centric to Data-centric AI). At 26:40, he talks about calculating the effectiveness of cleaning your data (training examples) vs. collecting new examples. Apparently this is possible to calculate using Shannon Information. As an example, Ng notes that "[if you have] 500 pictures of iguanas and if 12% of the labels are noisy (incorrectly labelled), then from a Shannon Information calculation, you can show that [cleaning up the noisy data and collecting another 500 new examples] are about equally effective.". Does anyone know how exactly this might be calculated? I've tried to find more information on this but without much luck. Intuitively, it seems related to Shannon's Limit but the exact application is unclear to me. Thanks. Edit: Cross-posted to Cross Validated submitted by /u/Academy- [link] [comments]  ( 3 min )
    [D] What are the environmental effects of using a pre-trained ML model powered by a GPU?
    The Internet has a lot of information concerning the environmental effects of training an ML model with GPUs. But less is available in terms of the emissions generated by running a trained model on a GPU. For example, Google has a language model called BERT. If I was to deploy it on a web server powered by a GPU, would the emissions be close to nil or would it actually harm the environment? submitted by /u/Alternative-Pause-14 [link] [comments]  ( 1 min )
    [D] ANN architecture for decoupling Signals
    I have read multiple papers and they never attach their source code like here https://www.science.org/doi/10.1126/scirobotics.abc6878 My problem is I have no idea how to apply ANN for decoupling different measured signals. I think I am not really capable to digest what these papers did, I would appreciate if someone can help me in understanding how to implement these algorithms from scratch. ​ Thanks submitted by /u/meldiwin [link] [comments]  ( 1 min )
  • Open

    DSC Weekly 24 May 2022: Is an AI Autumn Around the Corner?
    It is hard, looking at the current technological landscape, to believe that artificial intelligence may actually be facing a reckoning that could cause investment in the field to dry up. If you go by the press releases and even the many products that supposedly incorporate artificial-intelligence-oriented products, it would seem that computers capable of thinking… Read More »DSC Weekly 24 May 2022: Is an AI Autumn Around the Corner? The post DSC Weekly 24 May 2022: Is an AI Autumn Around the Corner? appeared first on Data Science Central.  ( 6 min )
    Top Ways in Which Data Science Improves E-Commerce Sales
    Data science refers to the use of algorithms, systems, technology, etc. to glean insights from data of all kinds. And, Machine learning (ML) and artificial intelligence (AI) make it possible for the shoppers with predictions based on what they like even before they decide to look for a specific product offering. Data science has been… Read More »Top Ways in Which Data Science Improves E-Commerce Sales The post Top Ways in Which Data Science Improves E-Commerce Sales appeared first on Data Science Central.  ( 3 min )
    Why is the Gig Economy a New Future?
    The expression “Gig Economy” alludes to an unrestricted economy framework where project work, term agreements, and brief positions are pervasive. The associations will enlist autonomous consultants or experts for momentary responsibilities. The expression “gig” is a shoptalk word for a task that endures a predefined period that was first made well known by artists or… Read More »Why is the Gig Economy a New Future? The post Why is the Gig Economy a New Future? appeared first on Data Science Central.  ( 4 min )
    Lessons Learned from Writing My First Python Script
    After 25 Years of Coding in C And Perl. As an independent author/researcher, there is of course nothing in my “job description” that says I should code in Python (or any other language). Yet for a long time, I thought coding in Python would help me a lot. It would mean more readers, and thus… Read More »Lessons Learned from Writing My First Python Script The post Lessons Learned from Writing My First Python Script appeared first on Data Science Central.  ( 6 min )
    Russian Troll Detection By Their Tweets
    This project for me was personal. I experienced the propaganda machine of the Soviet Union and am horrified to see it used on Americans. As a young adult in Soviet Russia, I succumbed to brainwashing and had no idea what was really going on. “Everybody always lies” had been the norm. I came to the… Read More »Russian Troll Detection By Their Tweets The post Russian Troll Detection By Their Tweets appeared first on Data Science Central.  ( 4 min )
    Importance of Fearlessness to Exploit the Potential of AI
    “Courage is not the absence of fear, but rather the judgment that something else is more important than fear.  The brave may not live forever, but the cautious do not live at all”– Philippe Renaldi, The Princess Diaries (2001) Optimizing versus innovating…the simple difference between surviving versus thriving. Yes, it is always easier to apply… Read More »Importance of Fearlessness to Exploit the Potential of AI The post Importance of Fearlessness to Exploit the Potential of AI appeared first on Data Science Central.  ( 7 min )
  • Open

    What is the current SOTA for Online RL?
    Hi everyone! Been reading a lot about RL and was wondering what the current SOTA models are for online RL.I have noticed that more and more models are starting to use transformers. Is this also where SOTA online RL models are going? Any ideas / oppinions about what the future direction of this field is would also be highly appreciated. Cheers submitted by /u/Idonai [link] [comments]  ( 1 min )
    Do you know any environments in which both the direction and the position of the agent are tracked?
    In the simple spread env, only position is tracked (https://github.com/openai/multiagent-particle-envs/blob/47e9ee38e605f8a563370b3c7e52a349eca3f6b1/multiagent/scenarios/simple_spread.py#L40) submitted by /u/No_Possibility_7588 [link] [comments]  ( 1 min )
    How to make game based RL research paper worthy?
    I am completely new to RL, I am experienced with NN. I decided to do my thesis on RL to take it as learning opportunity but Im beginning to find it quite intimidating to create a sim environment and apply RL given Im inexperieced with RL and have little time provided for thesis. Alternatively, I am suggested games which is also fun to work and to learn but Im struggling to comeup with a thesis statement such that I can use for publication as well. Some games I have considered are Go, Bomberland(Bomberman knock off from coderone), Kore (https://www.kaggle.com/competitions/kore-2022). I am also open to otherwise suggestions. submitted by /u/npc1111 [link] [comments]  ( 1 min )
    Exploration strategy for aerial robotics.
    In the continuous action tasks where a slight exploration can impact the environment drastically (like aerial robots), what exploration strategy can be used to safely explore the action space. I am currently using DDPG algorithm for position tracking in drones. I use Gaussian noise for exploration. The rewards for this tracking task are sparse as large negative rewards can make training unstable. So what can I use as the exploration strategy in this scenario. submitted by /u/Better-Ad8608 [link] [comments]  ( 1 min )
    Is DQN capable of 'solving' random dungeon traversal of unknown length and start/end positions?
    I'm interested in implementing DQN for a dungeon crawler I play. You are given a 2d map with your position as the central point and you need to traverse to the next zone, the map is limited in scope and is slowly revealed as you move along. There is a map based marker for the entrance to the next zone. ​ Since it is a dungeon of random size and random end/start positions, with no method to generate a reward until the agent gets to the next zone (ie the max overall reward is 1) is it possible for the agent to learn a policy in this scenario? submitted by /u/IFartedAndMyDickHurt [link] [comments]  ( 2 min )
    more thread, train crash
    In one computer, use ppo on swimmer. with 16 threads, the agent trains crash after 1 hour. with less than 16 threads, the agent trains smoothly and well https://preview.redd.it/fivxnkg11c191.png?width=727&format=png&auto=webp&s=4047c1b516d514619f0b73e38efc38d04bab5135 submitted by /u/OkSkirt5714 [link] [comments]  ( 1 min )
  • Open

    What is Real in a World of Social Constructs
    In today’s episode of Future Tech. I ask the question, what is real? Because in this world of social constructs, everything is created by… Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 7 min )
    Drupal Commerce: Why is it Ideal for Your E-Commerce Business
    The world of e-commerce is undeniably among the most lucrative sectors in the world at the moment and is expected to remain so for the…  ( 2 min )
  • Open

    The hype around DeepMind’s new AI model misses what’s actually cool about it
    submitted by /u/KelliaMcclure [link] [comments]  ( 1 min )
    In this tutorial, we show how to automate job search using fine-tuned NER model. Checkout the article to learn more!
    submitted by /u/UBIAI [link] [comments]
    Questionnaire for my dissertation on the subject of A.I
    Hey guys! I'm a student and I'm currently working on my dissertation for University. I'm using this as a way of collecting data on the representation of AI in movies and pop culture and I'd appreciate the responses! Here's the link: https://forms.gle/1jrzrfuSd3rFD6A17 submitted by /u/ZakUllah [link] [comments]
    Working on cognitive control in an artificial cognitive entity - when a machine can think about anything, how do you control what it thinks about and why?
    submitted by /u/DavidKShapiro [link] [comments]  ( 1 min )
    Saiyan transformation
    submitted by /u/Due-Ad9795 [link] [comments]
    MELODIES POSITIVE: Flowers from coconut cap.
    submitted by /u/cookingandcraft [link] [comments]
    6 Best Artificial Intelligence courses for Healthcare You should learn 2022
    submitted by /u/maneesh123456 [link] [comments]
    Unsupervised Tokenization Learning
    submitted by /u/akolonin [link] [comments]  ( 1 min )
    I can't wait for Dall E 2, but this will do for now!
    submitted by /u/BeginningRealistic49 [link] [comments]
    Google Brain's new model Imagen is incredible!
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 1 min )
    Fast.ai vs statistics.com?
    I’m currently doing a math and cs bachelors. I want to move into AI research (Something that is applicable currently but also contributes to the AGI field ideally) in the near future. Which site/resources should I use to learn relevant knowledge? Are there better ones? submitted by /u/BackgroundSense351 [link] [comments]  ( 1 min )
  • Open

    Image-Text Pre-training with Contrastive Captioners
    Posted by Zirui Wang and Jiahui Yu, Research Scientists, Google Research, Brain Team Oftentimes, machine learning (ML) model developers begin their design using a generic backbone model that is trained at scale and with capabilities transferable to a wide range of downstream tasks. In natural language processing, a number of popular backbone models, including BERT, T5, GPT-3 (sometimes also referred to as “foundation models”), are pre-trained on web-scale data and have demonstrated generic multi-tasking capabilities through zero-shot, few-shot or transfer learning. Compared with training over-specialized individual models, pre-training backbone models for a large number of downstream tasks can amortize the training costs, allowing one to overcome resource limitations when building large s…  ( 7 min )
  • Open

    Powering Next Generation Applications with OpenAI Codex
    Codex is now powering 70 different applications across a variety of use cases through the OpenAI API.  ( 3 min )
  • Open

    Is there any open public API like thispersondoesnotexist.com where I can specify gender and age?
    submitted by /u/AxelTheRabbit [link] [comments]
    How to fix cardinality error in my CNN model
    I am currently working on a CNN model where my inputs are CSV's with 1080 rows and 4 attributes(collumns). I have all the CSVs under their own categories directory, for example /Category A/..., /Category B/... etc. At start I have created two arrays: X = [] Y = [] and then in a for loop going thru all the directories I read the contents of the CSV in a `1080x4` shape and put it in my X array like this: (I have confirmed the values are correctly read and in correct shape) X.append(pandas.read_csv(itemPath).values) and I add the category of the item in my Y array right after, so the order of `X` items are alligned with order of `Y` items (categories of X values). Y.append(cat) Then here is my model, albeit I got it mostly from a Kaggle example, I just wanted to see if it works …  ( 4 min )
  • Open

    NVIDIA Brings Data Center, Robotics, Gaming, Content Creation Innovations to COMPUTEX
    Digital twins that revolutionize the way the most complex products are produced. Silicon and software that transforms data centers into AI factories. Gaming advances that bring the world’s most popular games to life. Taiwan has become the engine that brings the latest innovations to the world. So it only makes sense that NVIDIA leaders brought Read article > The post NVIDIA Brings Data Center, Robotics, Gaming, Content Creation Innovations to COMPUTEX appeared first on NVIDIA Blog.  ( 7 min )
    NVIDIA Adds Liquid-Cooled GPUs for Sustainable, Efficient Computing
    In the worldwide effort to halt climate change, Zac Smith is part of a growing movement to build data centers that deliver both high performance and energy efficiency. He’s head of edge infrastructure at Equinix, a global service provider that manages more than 240 data centers and is committed to becoming the first in its Read article > The post NVIDIA Adds Liquid-Cooled GPUs for Sustainable, Efficient Computing appeared first on NVIDIA Blog.  ( 4 min )
    NVIDIA Partners Announce Wave of New Jetson AGX Orin Servers and Appliances at COMPUTEX
    More than 30 leading technology partners worldwide announced this week the first wave of NVIDIA Jetson AGX Orin-powered production systems at COMPUTEX in Taipei. New products are coming from a dozen Taiwan-based camera, sensor and hardware providers for use in edge AI, AIoT, robotics and embedded applications. Available worldwide since GTC in March, the NVIDIA Read article > The post NVIDIA Partners Announce Wave of New Jetson AGX Orin Servers and Appliances at COMPUTEX appeared first on NVIDIA Blog.  ( 3 min )
    Master of Arts: NVIDIA RTX GPUs Accelerate Creative Ecosystems, Delivering Unmatched AI and Ray-Tracing Performance
    The future of content creation was on full display during the virtual NVIDIA keynote at COMPUTEX 2022, as the NVIDIA Studio platform expands with new Studio laptops and RTX-powered AI apps — all backed by the May Studio Driver released today. The post Master of Arts: NVIDIA RTX GPUs Accelerate Creative Ecosystems, Delivering Unmatched AI and Ray-Tracing Performance appeared first on NVIDIA Blog.  ( 6 min )
  • Open

    President Guðni Thorlacius Jóhannesson of Iceland visits MIT
    Delegation meets campus leaders, with an eye toward AI applications and the Icelandic language.  ( 6 min )

  • Open

    Our own stupidity might be an advantage for AI
    We are just smart enough to deal with problems logically, but too stupid to understand certain information hazards. Our "just good enough" understanding could be used where AI can't go submitted by /u/burner557799 [link] [comments]
    What are the opereting systems with the most ai technology build-in right now?
    submitted by /u/Scared_Assistance_28 [link] [comments]
    New Qualcomm RB6 AI Robot Cloud Accelerator Has Quad Core 2.85Ghz @ 200 Trillion Operations Per Second | Breakthrough Neural Network TPU Architecture
    submitted by /u/SlightSituation [link] [comments]  ( 1 min )
    When AI Meets the Transgender Community
    submitted by /u/punkthesystem [link] [comments]  ( 1 min )
    Meta's MyoSuite — An embodied AI platform that unifies neural and motor intelligence
    submitted by /u/SpatialComputing [link] [comments]  ( 1 min )
    Stanford And Oxford Researchers Propose An Approach To Relate Transformers To Models And Neural Representations Of The Hippocampal Formation
    In recent years, a significant part of neuroscience research has focused on relating deep learning architectures to the human brain, and many deep learning (DL) techniques have recently been shown to replicate neural firing patterns observed in the brain. For example, representations of convolutional neural networks have been shown to predict neurons in the visual cortex and inferior temporal cortex, while recurrent neural networks have been shown to recapitulate grid cells in the medial entorhinal cortex. The ability to use machine learning models to predict brain representations allows for a deeper understanding of the mechanistic computations of the respective brain areas and a deeper understanding of the nature of the models. However, one of the most exciting and promising new architec…  ( 2 min )
    Spring (A. I. animation + sound design)
    submitted by /u/nenomancer [link] [comments]  ( 1 min )
    Frame interpolation that allows for multiple sources?
    I'm working with material that was originally mastered in 29.97fps, then converted to 25fps for PAL territories. The PAL source is the best looking copy, in terms of detail, fewer digital artifacts, resolution, and colour - but it's obviously missing frames. Rather than just using AI to create new frames, is there a tool that will allow me to introduce multiple sources so the AI knows what those missing frames should best look like? submitted by /u/Plebsolute [link] [comments]  ( 1 min )
    Interesting AI
    Hello, I plan on giving a TEDx talk on AI because I love AI and I have spent a lot of time learning AI, Machine Learning, and Deep Learning (with all the math). I thought of talking about interesting AI that currently exist but there is many so I can't decide. I would love to hear your guys' ideas on what is your favourite and the most interesting AI you have seen which I could talk about. All responses are appreciated Thanks submitted by /u/No-Conversation8169 [link] [comments]  ( 1 min )
    Superhuman AI - Artificial intelligence predicts patients’ race from their medical images
    submitted by /u/qptbook [link] [comments]
    Machine learning has a backdoor problem
    submitted by /u/bendee983 [link] [comments]
    The StatQuest Illustrated Guide To Machine Learning eBook
    The StatQuest Illustrated Guide To Machine Learning submitted by /u/Futureisnotsecure [link] [comments]
    In The Latest AI Research, CMU And Adobe Researchers Propose An Elegant Emsembling Mechanism For GAN Training That Improves FID by 1.5x to 2x On The Given Dataset
    Image generation necessitates the ability to capture and model complicated statistics in real-world visual events. When trained on large-scale data, computer vision models have shown adept at capturing valuable representations, thanks to the effectiveness of supervised and self-supervised learning techniques. Surprisingly, despite the aforementioned link between synthesis and analysis, state-of-the-art generative adversarial networks (GANs) are trained without the use of such pre-trained networks in an unsupervised way. This is a squandered opportunity to investigate, given the abundance of relevant models readily available in the research ecosystem. Adobe researchers investigated the usage of a collection of pretrained deep feature extractors to aid in generative model training in a recent publication. GANs are specifically taught with a discriminator and a generator, which are both targeted at continuously learning the relevant statistics that distinguish genuinely and produced data. Continue Reading Paper: https://arxiv.org/pdf/2112.09130.pdf Github: https://github.com/nupurkmr9/vision-aided-gan Project: https://www.cs.cmu.edu/\~vision-aided-gan/ https://i.redd.it/y33lpsbpg6191.gif submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Flower pot.
    submitted by /u/cookingandcraft [link] [comments]
    15 Machine Learning Project (End to End)
    Hi Guys, Free tutorial on Machine Learning Project (End to End) in Apache Spark and Scala with Code and Explanation Machine Learning Pipeline Application on Power Plant. Build Movies Recommendation Engine Sales Prediction or Sale Forecast Mushroom Classification whether it’s edible or poisonous Predict Forest Cover Predict Will it Rain Tomorrow in Australia Customer Segmentation using Machine Learning Predict Ads Click (93% Accuracy) Prediction task is to determine whether a person makes over 50K a year Classifying gender based on personal preferences Mobile Price Classification Predicting the Cellular Localization Sites of Proteins in Yest YouTube Spam Comment Prediction Identify the Type of animal (7 Types) based on the available attributes Glass Identification Predicting the age of abalone from physical measurements I hope you'll enjoy these tutorials. submitted by /u/bigdataengineer4life [link] [comments]  ( 1 min )
    You are water-based, not silicon-based, life. Be like water.
    The late Terence McKenna, in one of his many talks, pointed out the two choices on the menu when it comes to why we are here. One is the Christian belief that God created the world in six days. The other one that a Big Bang happened some billions years ago and the universe sprang out of nothingness - as a random happening. His point was this: Both of these explanations are utterly improbable. Still if you hold the Christian belief, then you do not question it and hold it as "the truth". Likewise if you believe in the Big Bang explanation you hold that as truth. From other cultures we learn that the world actually resides on the back of a giant turtle. Sounds crazy right? .. except if you are brought up with that belief. Could the important thing be WHO YOU BECOME by believing a given…  ( 2 min )
  • Open

    [P] Imagen: Latest text-to-image generation model from Google Brain!
    Imagen - unprecedented photorealism × deep level of language understanding Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation. Human raters prefer Imagen over other models (such as DALL-E 2) in side-by-side comparisons, both in terms of sample quality and image-text alignment. https://gweb-research-imagen.appspot.com/ https://gweb-research-imagen.appspot.com/paper.pdf submitted by /u/aifordummies [link] [comments]  ( 1 min )
    [D] ECCV 2022 Reviews
    Now ECCV 2022 reviews are out, what is your general feeling about the quality of reviews? submitted by /u/margilly_ai [link] [comments]  ( 1 min )
    [Discussion] Natural Language as an intermediate link for solving not-NLP machine learning problems
    Hello. Recently I've stumbled upon the following RL paper: https://arxiv.org/abs/2202.08938 , and it made me super curious. In their experiments, an agent acts in the setup of some simple video game, and they use textual descriptions of the environment as an additional information for this agent, to help it to perform a novelty search more efficiently, because textual descriptions express real novelty of a new environment more meaningfully than raw data. As they write: "we explore natural language as a general medium for highlighting relevant abstractions in an environment". After reading it, I started to be curious, whether there are any other papers, where Natural Language descriptions are used as an intermediate tool to help Machine Learning model to perform some not-NLP task? Maybe someone even tried to teach NLP model to generate the most useful hints for not-NLP model to solve not-NLP tasks? I would be grateful to learn any info about such research direction. Thanks! submitted by /u/KushnarevaL [link] [comments]  ( 1 min )
    [D]I want to compare train of a CNN on a dataset A vs pretraining on another bigger dataset B + finetuning dataset on dataset A. Should I use the same learning rate for both experiments, or use a diminished learning rate for finetuning?
    I have read that it is a common place to use a reduced learning rate when finetuning on a new dataset for transfer learning. On the other hand it seems to me not a good practice to do that and compare with the simple training that was performed with a larger training rate. submitted by /u/TheManveru [link] [comments]  ( 1 min )
    [D] Scaling sentence embeddings for PCA
    I’m doing PCA for different sentence embeddings (word2vec, BERTtweet, InferSent…) of my data. My question is, should I scale these embeddings before putting them into PCA. I know it’s a standard practice in ML when using PCA, but idk if it still stands for sentence embeddings. Also if I should, will standard scaler be ok? submitted by /u/No_Technology1455 [link] [comments]  ( 1 min )
    [D] What's the intuition behind certain CNN architectures?
    My knowledge about CNNs is relatively basic. I know what they do, but I have not enough experience to understand the different choices in architectures (besides the obvious improvements, like a residual block). For what I do I never needed to rely on CNNs, so whenever I worked with them I just used some existing CNN structures, but now I would like to learn more about it, so that I can optimize my current models. So for example looking at some CNNs used in reinforcement learning, here's one in particular (the CNN from the Atari self-learning Nature article): self.base = nn.Sequential( init_(nn.Conv2d(4, 32, kernel_size=8, stride=4, padding=0)), nn.ReLU(), init_(nn.Conv2d(32, 64, kernel_size=4, stride=2, padding=0)), nn.ReLU(), init_(nn.Conv2d(64, 32, kernel_size=3, stride=1, padding=0)), nn.ReLU(), Flatten(), init_(nn.Linear(32 * 7 * 7, outputs)), nn.ReLU() ) The input is a stack of 4 frames with 84x84 grayscale pixels. I understand that you'd want to divide this image into many different smaller fields, but what's the intuition of doing it in such a way, e.g. 3 layers of CNNs? Why not 5? Why not 2 or 1, but with more outputs? In fact, it seems to me that a parallel approach with different parameters, instead of a sequential approach would be superior, since it seems to me that information would get lost after the first layer that uses a kernel size of 8. Are there any resources that you could recommend and that delve deeply into the nuances when creating CNN architectures? Thank you submitted by /u/NikEy [link] [comments]  ( 7 min )
    AI book reading platform using machine learning [P]
    Greetings Folks, In the past year, we had released a book reading AI tool to search for content within files using natural search, and we had received constructive feedback from the machine learning community. We are releasing, a new updated version with a fresh UI overhaul (desktop support) https://rastero.io/books/explore This project utilizes the sentence transformers library from UKP labs found here. Glad to share it with you all. Here's an account to try it out! username: reddit password: reddit2022 A demo of the interface to search using semantic similarity based search . submitted by /u/deep_ak [link] [comments]  ( 1 min )
    [Discussion] What are frameworks used for Human-In-The-Loop (Active) learning ?
    We have a new request from our product team : "Develop a human in the loop binary classifier for a fashion application. The classifier should learn the themes of photos and classify the future data with high degree of confidence. " The human in the loop will be an annotator (or) expert whom we can query to label the examples (of mini-batches). The number of queries is limited by confidence score and budget. Are there any tools/frameworks for building such (active learning) applications ? submitted by /u/UncertainLangur [link] [comments]  ( 1 min )
    [D] Anyone waiting for ECCV reviews?
    They should come out today, but I guess they are delayed. submitted by /u/SeucheAchat9115 [link] [comments]
    [P] CTranslate2: an efficient inference engine for Transformer models
    Hi! I'd like to share this project I've been working on for almost 4 years: https://github.com/OpenNMT/CTranslate2 CTranslate2 is a C++ and Python library for efficient inference with Transformer models. While the project initially focused on translation models (hence the name), it also supports autoregressive language models such as GPT-2 and the recent OPT models from Meta. The library comes with a highly optimized runtime that implements various performance optimization techniques such as weight quantization, layer fusion, batch reordering, padding removal, etc. (Check out these benchmark tables for a comparison with other frameworks on a translation task.) We provide model converters for multiple frameworks: OpenNMT Fairseq Marian Hugging Face's Transformers You can also add you own converter if the model architecture is supported. We currently support selected variants of encoder-decoder and decoder-only Transformer models (including pre-norm and post-norm). If you'd like to learn more, please visit the GitHub repository and feel free to post questions or suggestions about the project! Thanks! submitted by /u/guillaumekln [link] [comments]  ( 1 min )
    [R] Self-Net: Lifelong Learning Via Continual Self-Modeling
    submitted by /u/EducationalCicada [link] [comments]
    [N] ShiftHappens Workshop @ICML 2022 welcoming submissions & AMA
    I'm one of the organizers of the ShiftHappens Workshop hosted at ICML 2022. The goal of the workshop is to create a community-built robustness benchmark, incorporating new challenging datasets and tasks. Submissions can be ImageNet-scale datasets or evaluation tasks that help finding potentially unforeseen behaviours of tested models. All accepted contributions will be part of the benchmark and authors can become co-authors of a summary paper. To make it easy to contribute, we now also accept submissions in the form of extended abstracts, where an interesting idea for a dataset or task is explained in a one-page paper format, see also our latest tweet! We welcome datasets and tasks published at previous conferences as well as novel and work-in-progress papers. So if you e.g. evaluated your training method on a new task highlighting its advantages, we would be happy to receive that task as a submission :) If you don't, share this with friends and colleagues that might ;) Also check out our strong line-up of invited speakers which you might not want to miss if you are at the ICML conference this year! I and some of the other organizers are around to answer any questions about submitting, data shift and the workshop in general. submitted by /u/JBitterwolf [link] [comments]  ( 1 min )
    [R] Nothing makes sense in deep learning, except in the light of evolution
    submitted by /u/DevFRus [link] [comments]  ( 1 min )
    [D] - Is it safe to force stop, save weights and later resume a Tensorflow model during training?
    So I am working with a Unet (CNN) architecture with an Adam optimizer. Occasionally I have forced stopped the model, saved the weights, then resumed the training later on. When I resume, the loss plot seems to look less smooth compared to what it did earlier, but the loss also seems to have decreased significantly. Is it generally safe to stop and resume model training in this way? Also, I'm not sure how to interpret the decreased but less smooth loss function. submitted by /u/suoarski [link] [comments]  ( 2 min )
    ICML 2022 papers with affiliations [D]
    https://icml.cc/Conferences/2022/AcceptedPapersInitial On a quick ctrl-F : Tsinghua (157) overtakes Stanford (139). In 2021, these values were Tsinghua (71) and Stanford (140). http://web.archive.org/web/20210618094315/https://icml.cc/Conferences/2021/AcceptedPapersInitial ^ 2021 for comparison submitted by /u/Simping4Kaiming [link] [comments]
    [D] ML library (ideally for R) with a good software design?
    Hi! As a Machine Learning Engineer, I was studying the design patterns behind scikit-learn's API (you can see here and here) and I was wondering if any of you know of something similar but for R that I can check. Note: I am asking about R because that's what I am using and it is difficult to find something functional programming oriented for other languages, but any other library you find interesting is welcome! submitted by /u/Silver_Book_938 [link] [comments]  ( 2 min )
    [D] How to obtain frequencies of English word phrases?
    I know one way to calculate frequencies of English word phrases such as "it's raining today" is to simply search for that phrase in a large text corpus, but would anyone happen to know of any simpler/newer ways to do that? I'm wondering if it's somehow possible to calculate the frequencies of word phrases using a large language model such as GPT-3. submitted by /u/Specialist_Art_8502 [link] [comments]  ( 1 min )
  • Open

    Breakthrough Stashing Algorithm & Neural Network TPU Architecture
    submitted by /u/getrich_or_diemining [link] [comments]
    How I (Kinda) Created an A.I. Generated Cartoon
    submitted by /u/BasicallyJustASpider [link] [comments]  ( 1 min )
    The StatQuest Illustrated Guide To Machine Learning eBook
    submitted by /u/Futureisnotsecure [link] [comments]  ( 1 min )
  • Open

    Voiced and unvoiced consonants and digits
    The latest episode of The History of English Podcast discusses the history of pronunciation changes in the Elizabethan period. The episode has a lot to say about the connections between voiced and unvoiced pairs of consonants, and the circumstances under which a consonant might change from voiced to unvoiced and vice versa. The major mnemonic […] Voiced and unvoiced consonants and digits first appeared on John D. Cook.  ( 2 min )
    Year share
    This post will be about psychology as much as math, looking at a number of algorithms for mentally calculating the same function. The most difficult part of mentally computing days of the week is computing ⌊5y/4⌋ % 7 where y is the last two digits of a year. This quantity is called the year share […] Year share first appeared on John D. Cook.  ( 3 min )
  • Open

    What are some recent or timeless interesting papers on approaches to Dynamic Pricing in RL? I'm thinking uber surcharging or limited supply queue length dependant pricing policies to maximize revenue.
    submitted by /u/n3ver_summer [link] [comments]  ( 1 min )
    Is it possible to use the MuJoCo Gym environments with the new Python binding ?
    It appears Gym uses the old python integration "mujoco-py" rather than the new official one (https://mujoco.readthedocs.io/en/latest/python.html). Is it possible to use the gym environments with the new Python binding ? If not will Gym be updated to support the new official python binding ? submitted by /u/arckonte [link] [comments]  ( 1 min )
    7 Best Keras Online Courses for Deep Learning
    submitted by /u/MlTut [link] [comments]
    Is semi-gradient TD(lambda) + experience replay make sense?
    submitted by /u/Professional_Card176 [link] [comments]  ( 1 min )
  • Open

    How HRs Can Use Recruitment Data Analytics for Better Hiring
    Nearly every forward-thinking organization uses analytics in recruitment to bring efficiency to its hiring process. A significant…  ( 3 min )
  • Open

    (De)ToxiGen: Leveraging large language models to build more robust hate speech detection tools
    It’s a well-known challenge that large language models (LLMs)—growing in popularity thanks to their adaptability across a variety of applications—carry risks. Because they’re trained on large amounts of data from across the internet, they’re capable of generating inappropriate and harmful language based on similar language encountered during training.   Content moderation tools can be deployed to […] The post (De)ToxiGen: Leveraging large language models to build more robust hate speech detection tools appeared first on Microsoft Research.  ( 10 min )
    Partnering people with large language models to find and fix bugs in NLP systems
    Advances in platform models—large-scale models that can serve as foundations across applications—have significantly improved the ability of computers to process natural language. But natural language processing (NLP) models are still far from perfect, sometimes failing in embarrassing ways, like translating “Eu não recomendo este prato” (I don’t recommend this dish) in Portuguese to “I highly […] The post Partnering people with large language models to find and fix bugs in NLP systems appeared first on Microsoft Research.  ( 10 min )
  • Open

    Energy Grids Plug into AI for a Brighter, Cleaner Future
    Electric utilities are taking a course in machine learning to create smarter grids for tough challenges ahead. The winter 2021 megastorm in Texas left millions without power. Grid failures the past two summers sparked devastating wildfires amid California’s record drought. “Extreme weather events of 2021 highlighted the risks climate change is introducing, and the importance Read article > The post Energy Grids Plug into AI for a Brighter, Cleaner Future appeared first on NVIDIA Blog.  ( 5 min )
  • Open

    Why Should the Retail Sector Move to Cloud Computing?
    What is cloud computing? It is one of the foremost technological innovations of the 21st century and has seen the fastest adoption into mainstream use than any other technology. Simply put, it’s the ability to store and run software programs on the cloud platforms rather than storing and running them on local servers or computers.… Read More »Why Should the Retail Sector Move to Cloud Computing? The post Why Should the Retail Sector Move to Cloud Computing? appeared first on Data Science Central.  ( 4 min )
  • Open

    Deconfounding Actor-Critic Network with Policy Adaptation for Dynamic Treatment Regimes. (arXiv:2205.09852v1 [cs.LG])
    Despite intense efforts in basic and clinical research, an individualized ventilation strategy for critically ill patients remains a major challenge. Recently, dynamic treatment regime (DTR) with reinforcement learning (RL) on electronic health records (EHR) has attracted interest from both the healthcare industry and machine learning research community. However, most learned DTR policies might be biased due to the existence of confounders. Although some treatment actions non-survivors received may be helpful, if confounders cause the mortality, the training of RL models guided by long-term outcomes (e.g., 90-day mortality) would punish those treatment actions causing the learned DTR policies to be suboptimal. In this study, we develop a new deconfounding actor-critic network (DAC) to learn optimal DTR policies for patients. To alleviate confounding issues, we incorporate a patient resampling module and a confounding balance module into our actor-critic framework. To avoid punishing the effective treatment actions non-survivors received, we design a short-term reward to capture patients' immediate health state changes. Combining short-term with long-term rewards could further improve the model performance. Moreover, we introduce a policy adaptation method to successfully transfer the learned model to new-source small-scale datasets. The experimental results on one semi-synthetic and two different real-world datasets show the proposed model outperforms the state-of-the-art models. The proposed model provides individualized treatment decisions for mechanical ventilation that could improve patient outcomes.  ( 2 min )
    Unraveling Attention via Convex Duality: Analysis and Interpretations of Vision Transformers. (arXiv:2205.08078v2 [cs.LG] UPDATED)
    Vision transformers using self-attention or its proposed alternatives have demonstrated promising results in many image related tasks. However, the underpinning inductive bias of attention is not well understood. To address this issue, this paper analyzes attention through the lens of convex duality. For the non-linear dot-product self-attention, and alternative mechanisms such as MLP-mixer and Fourier Neural Operator (FNO), we derive equivalent finite-dimensional convex problems that are interpretable and solvable to global optimality. The convex programs lead to {\it block nuclear-norm regularization} that promotes low rank in the latent feature and token dimensions. In particular, we show how self-attention networks implicitly clusters the tokens, based on their latent similarity. We conduct experiments for transferring a pre-trained transformer backbone for CIFAR-100 classification by fine-tuning a variety of convex attention heads. The results indicate the merits of the bias induced by attention compared with the existing MLP or linear heads.  ( 2 min )
    Let the Model Decide its Curriculum for Multitask Learning. (arXiv:2205.09898v1 [cs.LG])
    Curriculum learning strategies in prior multi-task learning approaches arrange datasets in a difficulty hierarchy either based on human perception or by exhaustively searching the optimal arrangement. However, human perception of difficulty may not always correlate well with machine interpretation leading to poor performance and exhaustive search is computationally expensive. Addressing these concerns, we propose two classes of techniques to arrange training instances into a learning curriculum based on difficulty scores computed via model-based approaches. The two classes i.e Dataset-level and Instance-level differ in granularity of arrangement. Through comprehensive experiments with 12 datasets, we show that instance-level and dataset-level techniques result in strong representations as they lead to an average performance improvement of 4.17% and 3.15% over their respective baselines. Furthermore, we find that most of this improvement comes from correctly answering the difficult instances, implying a greater efficacy of our techniques on difficult tasks.  ( 2 min )
    Incremental Learning with Differentiable Architecture and Forgetting Search. (arXiv:2205.09875v1 [cs.LG])
    As progress is made on training machine learning models on incrementally expanding classification tasks (i.e., incremental learning), a next step is to translate this progress to industry expectations. One technique missing from incremental learning is automatic architecture design via Neural Architecture Search (NAS). In this paper, we show that leveraging NAS for incremental learning results in strong performance gains for classification tasks. Specifically, we contribute the following: first, we create a strong baseline approach for incremental learning based on Differentiable Architecture Search (DARTS) and state-of-the-art incremental learning strategies, outperforming many existing strategies trained with similar-sized popular architectures; second, we extend the idea of architecture search to regularize architecture forgetting, boosting performance past our proposed baseline. We evaluate our method on both RF signal and image classification tasks, and demonstrate we can achieve up to a 10% performance increase over state-of-the-art methods. Most importantly, our contribution enables learning from continuous distributions on real-world application data for which the complexity of the data distribution is unknown, or the modality less explored (such as RF signal classification).  ( 2 min )
    Predicting electrode array impedance after one month from cochlear implantation surgery. (arXiv:2205.10021v1 [cs.LG])
    Sensorineural hearing loss can be treated using Cochlear implantation. After this surgery using the electrode array impedance measurements, we can check the stability of the impedance value and the dynamic range. Deterioration of speech recognition scores could happen because of increased impedance values. Medicines used to do these measures many times during a year after the surgery. Predicting the electrode impedance could help in taking decisions to help the patient get better hearing. In this research we used a dataset of 80 patients of children who did cochlear implantation using MED-EL FLEX28 electrode array of 12 channels. We predicted the electrode impedance on each channel after 1 month from the surgery date. We used different machine learning algorithms like neural networks and decision trees. Our results indicates that the electrode impedance can be predicted, and the best algorithm is different based on the electrode channel. Also, the accuracy level varies between 66% and 100% based on the electrode channel when accepting an error range between 0 and 3 KO. Further research is required to predict the electrode impedance after three months, six months and one year.  ( 2 min )
    Neural-Symbolic Models for Logical Queries on Knowledge Graphs. (arXiv:2205.10128v1 [cs.AI])
    Answering complex first-order logic (FOL) queries on knowledge graphs is a fundamental task for multi-hop reasoning. Traditional symbolic methods traverse a complete knowledge graph to extract the answers, which provides good interpretation for each step. Recent neural methods learn geometric embeddings for complex queries. These methods can generalize to incomplete knowledge graphs, but their reasoning process is hard to interpret. In this paper, we propose Graph Neural Network Query Executor (GNN-QE), a neural-symbolic model that enjoys the advantages of both worlds. GNN-QE decomposes a complex FOL query into relation projections and logical operations over fuzzy sets, which provides interpretability for intermediate variables. To reason about the missing links, GNN-QE adapts a graph neural network from knowledge graph completion to execute the relation projections, and models the logical operations with product fuzzy logic. Extensive experiments on 3 datasets show that GNN-QE significantly improves over previous state-of-the-art models in answering FOL queries. Meanwhile, GNN-QE can predict the number of answers without explicit supervision, and provide visualizations for intermediate variables.  ( 2 min )
    Robust Expected Information Gain for Optimal Bayesian Experimental Design Using Ambiguity Sets. (arXiv:2205.09914v1 [stat.ML])
    The ranking of experiments by expected information gain (EIG) in Bayesian experimental design is sensitive to changes in the model's prior distribution, and the approximation of EIG yielded by sampling will have errors similar to the use of a perturbed prior. We define and analyze \emph{robust expected information gain} (REIG), a modification of the objective in EIG maximization by minimizing an affine relaxation of EIG over an ambiguity set of distributions that are close to the original prior in KL-divergence. We show that, when combined with a sampling-based approach to estimating EIG, REIG corresponds to a `log-sum-exp' stabilization of the samples used to estimate EIG, meaning that it can be efficiently implemented in practice. Numerical tests combining REIG with variational nested Monte Carlo (VNMC), adaptive contrastive estimation (ACE) and mutual information neural estimation (MINE) suggest that in practice REIG also compensates for the variability of under-sampled estimators.  ( 2 min )
    When, where, and how to add new neurons to ANNs. (arXiv:2202.08539v2 [cs.LG] UPDATED)
    Neurogenesis in ANNs is an understudied and difficult problem, even compared to other forms of structural learning like pruning. By decomposing it into triggers and initializations, we introduce a framework for studying the various facets of neurogenesis: when, where, and how to add neurons during the learning process. We present the Neural Orthogonality (NORTH*) suite of neurogenesis strategies, combining layer-wise triggers and initializations based on the orthogonality of activations or weights to dynamically grow performant networks that converge to an efficient size. We evaluate our contributions against other recent neurogenesis works across a variety of supervised learning tasks.  ( 2 min )
    Memorization and Optimization in Deep Neural Networks with Minimum Over-parameterization. (arXiv:2205.10217v1 [stat.ML])
    The Neural Tangent Kernel (NTK) has emerged as a powerful tool to provide memorization, optimization and generalization guarantees in deep neural networks. A line of work has studied the NTK spectrum for two-layer and deep networks with at least a layer with $\Omega(N)$ neurons, $N$ being the number of training samples. Furthermore, there is increasing evidence suggesting that deep networks with sub-linear layer widths are powerful memorizers and optimizers, as long as the number of parameters exceeds the number of samples. Thus, a natural open question is whether the NTK is well conditioned in such a challenging sub-linear setup. In this paper, we answer this question in the affirmative. Our key technical contribution is a lower bound on the smallest NTK eigenvalue for deep networks with the minimum possible over-parameterization: the number of parameters is roughly $\Omega(N)$ and, hence, the number of neurons is as little as $\Omega(\sqrt{N})$. To showcase the applicability of our NTK bounds, we provide two results concerning memorization capacity and optimization guarantees for gradient descent training.  ( 2 min )
    KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation. (arXiv:2205.09921v1 [cs.CL])
    Relative positional embeddings (RPE) have received considerable attention since RPEs effectively model the relative distance among tokens and enable length extrapolation. We propose KERPLE, a framework that generalizes relative position embedding for extrapolation by kernelizing positional differences. We achieve this goal using conditionally positive definite (CPD) kernels, a class of functions known for generalizing distance metrics. To maintain the inner product interpretation of self-attention, we show that a CPD kernel can be transformed into a PD kernel by adding a constant offset. This offset is implicitly absorbed in the Softmax normalization during self-attention. The diversity of CPD kernels allows us to derive various RPEs that enable length extrapolation in a principled way. Experiments demonstrate that the logarithmic variant achieves excellent extrapolation performance on three large language modeling datasets.  ( 2 min )
    Mean-Field Analysis of Two-Layer Neural Networks: Global Optimality with Linear Convergence Rates. (arXiv:2205.09860v1 [cs.LG])
    We consider optimizing two-layer neural networks in the mean-field regime where the learning dynamics of network weights can be approximated by the evolution in the space of probability measures over the weight parameters associated with the neurons. The mean-field regime is a theoretically attractive alternative to the NTK (lazy training) regime which is only restricted locally in the so-called neural tangent kernel space around specialized initializations. Several prior works (\cite{mei2018mean, chizat2018global}) establish the asymptotic global optimality of the mean-field regime, but it is still challenging to obtain a quantitative convergence rate due to the complicated nonlinearity of the training dynamics. This work establishes a new linear convergence result for two-layer neural networks trained by continuous-time noisy gradient descent in the mean-field regime. Our result relies on a novelty logarithmic Sobolev inequality for two-layer neural networks, and uniform upper bounds on the logarithmic Sobolev constants for a family of measures determined by the evolving distribution of hidden neurons.  ( 2 min )
    An Artificial Neural Network Functionalized by Evolution. (arXiv:2205.10118v1 [cs.NE])
    The topology of artificial neural networks has a significant effect on their performance. Characterizing efficient topology is a field of promising research in Artificial Intelligence. However, it is not a trivial task and it is mainly experimented on through convolutional neural networks. We propose a hybrid model which combines the tensor calculus of feed-forward neural networks with Pseudo-Darwinian mechanisms. This allows for finding topologies that are well adapted for elaboration of strategies, control problems or pattern recognition tasks. In particular, the model can provide adapted topologies at early evolutionary stages, and 'structural convergence', which can found applications in robotics, big-data and artificial life.  ( 2 min )
    Topology-aware Graph Neural Networks for Learning Feasible and Adaptive ac-OPF Solutions. (arXiv:2205.10129v1 [eess.SY])
    Solving the optimal power flow (OPF) problem is a fundamental task to ensure the system efficiency and reliability in real-time electricity grid operations. We develop a new topology-informed graph neural network (GNN) approach for predicting the optimal solutions of real-time ac-OPF problem. To incorporate grid topology to the NN model, the proposed GNN-for-OPF framework innovatively exploits the locality property of locational marginal prices and voltage magnitude. Furthermore, we develop a physics-aware (ac-)flow feasibility regularization approach for general OPF learning. The advantages of our proposed designs include reduced model complexity, improved generalizability and feasibility guarantees. By providing the analytical understanding on the graph subspace stability under grid topology contingency, we show the proposed GNN can quickly adapt to varying grid topology by an efficient re-training strategy. Numerical tests on various test systems of different sizes have validated the prediction accuracy, improved flow feasibility, and topology adaptivity capability of our proposed GNN-based learning framework.  ( 2 min )
    Bounding the Effects of Continuous Treatments for Hidden Confounders. (arXiv:2204.11206v2 [stat.ME] UPDATED)
    Observational studies often seek to infer the causal effect of a treatment even though both the assigned treatment and the outcome depend on other confounding variables. An effective strategy for dealing with confounders is to estimate a propensity model that corrects for the relationship between covariates and assigned treatment. Unfortunately, the confounding variables themselves are not always observed, in which case we can only bound the propensity, and therefore bound the magnitude of causal effects. In many important cases, like administering a dose of some medicine, the possible treatments belong to a continuum. Sensitivity models, which are required to tie the true propensity to something that can be estimated, have been explored for binary treatments. We propose one for continuous treatments. We develop a framework to compute ignorance intervals on the partially identified dose-response curves, enabling us to quantify the susceptibility of an inference to hidden confounders. We show with simulations and three real-world observational studies that our approach can give non-trivial bounds on causal effects from continuous treatments in the presence of hidden confounders.
    Sparse Infinite Random Feature Latent Variable Modeling. (arXiv:2205.09909v1 [stat.ML])
    We propose a non-linear, Bayesian non-parametric latent variable model where the latent space is assumed to be sparse and infinite dimensional a priori using an Indian buffet process prior. A posteriori, the number of instantiated dimensions in the latent space is guaranteed to be finite. The purpose of placing the Indian buffet process on the latent variables is to: 1.) Automatically and probabilistically select the number of latent dimensions. 2.) Impose sparsity in the latent space, where the Indian buffet process will select which elements are exactly zero. Our proposed model allows for sparse, non-linear latent variable modeling where the number of latent dimensions is selected automatically. Inference is made tractable using the random Fourier approximation and we can easily implement posterior inference through Markov chain Monte Carlo sampling. This approach is amenable to many observation models beyond the Gaussian setting. We demonstrate the utility of our method on a variety of synthetic, biological and text datasets and show that we can obtain superior test set performance compared to previous latent variable models.
    RiskLoc: Localization of Multi-dimensional Root Causes by Weighted Risk. (arXiv:2205.10004v1 [cs.LG])
    Failures and anomalies in large-scale software systems are unavoidable incidents. When an issue is detected, operators need to quickly and correctly identify its location to facilitate a swift repair. In this work, we consider the problem of identifying the root cause set that best explains an anomaly in multi-dimensional time series with categorical attributes. The huge search space is the main challenge, even for a small number of attributes and small value sets, the number of theoretical combinations is too large to brute force. Previous approaches have thus focused on reducing the search space, but they all suffer from various issues, requiring extensive manual parameter tuning, being too slow and thus impractical, or being incapable of finding more complex root causes. We propose RiskLoc to solve the problem of multidimensional root cause localization. RiskLoc applies a 2-way partitioning scheme and assigns element weights that linearly increase with the distance from the partitioning point. A risk score is assigned to each element that integrates two factors, 1) its weighted proportion within the abnormal partition, and 2) the relative change in the deviation score adjusted for the ripple effect property. Extensive experiments on multiple datasets verify the effectiveness and efficiency of RiskLoc, and for a comprehensive evaluation, we introduce three synthetically generated datasets that complement existing datasets. We demonstrate that RiskLoc consistently outperforms state-of-the-art baselines, especially in more challenging root cause scenarios, with gains in F1-score up to 57% over the second-best approach with comparable running times.
    Sigmoidally Preconditioned Off-policy Learning:a new exploration method for reinforcement learning. (arXiv:2205.10047v1 [cs.LG])
    One of the major difficulties of reinforcement learning is learning from {\em off-policy} samples, which are collected by a different policy (behavior policy) from what the algorithm evaluates (the target policy). Off-policy learning needs to correct the distribution of the samples from the behavior policy towards that of the target policy. Unfortunately, important sampling has an inherent high variance issue which leads to poor gradient estimation in policy gradient methods. We focus on an off-policy Actor-Critic architecture, and propose a novel method, called Preconditioned Proximal Policy Optimization (P3O), which can control the high variance of importance sampling by applying a preconditioner to the Conservative Policy Iteration (CPI) objective. {\em This preconditioning uses the sigmoid function in a special way that when there is no policy change, the gradient is maximal and hence policy gradient will drive a big parameter update for an efficient exploration of the parameter space}. This is a novel exploration method that has not been studied before given that existing exploration methods are based on the novelty of states and actions. We compare with several best-performing algorithms on both discrete and continuous tasks and the results confirmed that {\em P3O is more off-policy than PPO} according to the "off-policyness" measured by the DEON metric, and P3O explores in a larger policy space than PPO. Results also show that our P3O maximizes the CPI objective better than PPO during the training process.
    Understanding Why Generalized Reweighting Does Not Improve Over ERM. (arXiv:2201.12293v3 [cs.LG] UPDATED)
    Empirical risk minimization (ERM) is known in practice to be non-robust to distributional shift where the training and the test distributions are different. A suite of approaches, such as importance weighting, and variants of distributionally robust optimization (DRO), have been proposed to solve this problem. But a line of recent work has empirically shown that these approaches do not significantly improve over ERM in real applications with distribution shift. The goal of this work is to obtain a comprehensive theoretical understanding of this intriguing phenomenon. We first posit the class of Generalized Reweighting (GRW) algorithms, as a broad category of approaches that iteratively update model parameters based on iterative reweighting of the training samples. We show that when overparameterized models are trained under GRW, the resulting models are close to that obtained by ERM. We also show that adding small regularization which does not greatly affect the empirical training accuracy does not help. Together, our results show that a broad category of what we term GRW approaches are not able to achieve distributionally robust generalization. Our work thus has the following sobering takeaway: to make progress towards distributionally robust generalization, we either have to develop non-GRW approaches, or perhaps devise novel classification/regression loss functions that are adapted to the class of GRW approaches.
    Optimal Parameter-free Online Learning with Switching Cost. (arXiv:2205.06846v2 [cs.LG] UPDATED)
    Parameter-freeness in online learning refers to the adaptivity of an algorithm with respect to the optimal decision in hindsight. In this paper, we design such algorithms in the presence of switching cost - the latter penalizes the optimistic updates required by parameter-freeness, leading to a delicate design trade-off. Based on a novel dual space scaling strategy, we propose a simple yet powerful algorithm for Online Linear Optimization (OLO) with switching cost, which improves the existing suboptimal regret bound [ZCP22a] to the optimal rate. The obtained benefit is extended to the expert setting, and the practicality of our algorithm is demonstrated through a sequential investment task.
    A toolbox for idea generation and evaluation: Machine learning, data-driven, and contest-driven approaches to support idea generation. (arXiv:2205.09840v1 [cs.LG])
    The significance and abundance of data are increasing due to the growing digital data generated from social media, sensors, scholarly literature, patents, different forms of documents published online, databases, product manuals, etc. Various data sources can be used to generate ideas, yet, in addition to bias, the size of the available digital data is a major challenge when it comes to manual analysis. Hence, human-machine interaction is essential for generating valuable ideas where machine learning and data-driven techniques generate patterns from data and serve human sense-making. However, the use of machine learning and data-driven approaches to generate ideas is a relatively new area. Moreover, it is also possible to stimulate innovation using contest-driven idea generation and evaluation. The results and contributions of this thesis can be viewed as a toolbox of idea-generation techniques, including a list of data-driven and machine learning techniques with corresponding data sources and models to support idea generation. In addition, the results include two models, one method and one framework, to better support data-driven and contest- driven idea generation. The beneficiaries of these artefacts are practitioners in data and knowledge engineering, data mining project managers, and innovation agents. Innovation agents include incubators, contest organizers, consultants, innovation accelerators, and industries. Since the proposed artefacts consist of process models augmented with AI techniques, human-centred AI is a promising area of research that can contribute to the artefacts' further development and promote creativity.
    STaR: Bootstrapping Reasoning With Reasoning. (arXiv:2203.14465v2 [cs.LG] UPDATED)
    Generating step-by-step "chain-of-thought" rationales improves language model performance on complex reasoning tasks like mathematics or commonsense question-answering. However, inducing language model rationale generation currently requires either constructing massive rationale datasets or sacrificing accuracy by using only few-shot inference. We propose a technique to iteratively leverage a small number of rationale examples and a large dataset without rationales, to bootstrap the ability to perform successively more complex reasoning. This technique, the "Self-Taught Reasoner" (STaR), relies on a simple loop: generate rationales to answer many questions, prompted with a few rationale examples; if the generated answers are wrong, try again to generate a rationale given the correct answer; fine-tune on all the rationales that ultimately yielded correct answers; repeat. We show that STaR significantly improves performance on multiple datasets compared to a model fine-tuned to directly predict final answers, and performs comparably to fine-tuning a 30$\times$ larger state-of-the-art language model on CommensenseQA. Thus, STaR lets a model improve itself by learning from its own generated reasoning.
    Lossless Speedup of Autoregressive Translation with Generalized Aggressive Decoding. (arXiv:2203.16487v4 [cs.CL] UPDATED)
    Different from previous work accelerating translation at the cost of quality loss, we propose Generalized Aggressive Decoding (GAD) -- a novel decoding paradigm for lossless speedup of autoregressive translation, through the collaboration of autoregressive and non-autoregressive translation (NAT) of the Transformer. At each decoding iteration, GAD aggressively decodes a number of tokens with NAT as a draft and then verifies them in the autoregressive manner, where only the tokens that pass the verification are accepted as decoded tokens. GAD can achieve the same results as autoregressive translation but much more efficiently because both NAT drafting and autoregressive verification compute in parallel. We conduct experiments in four standard WMT benchmarks and confirm that the vanilla GAD yields exactly the same results as greedy decoding with an around $3\times$ speedup, and that its variant (GAD++) with an advanced verification strategy not only outperforms the greedy translation and even achieves the comparable translation quality with the beam search result, but also further improves the decoding speed, resulting in an around $5\times$ speedup over autoregressive translation. Moreover, GAD can be easily generalized for lossless speedup of other seq2seq tasks like Abstractive Summarization, and benefit more from stronger computing devices, demonstrating its potential to become a de facto decoding paradigm in the future. Our models and codes are available at https://github.com/hemingkx/GAD.
    Mosaic Zonotope Shadow Matching for Risk-Aware Autonomous Localization in Harsh Urban Environments. (arXiv:2205.10223v1 [cs.AI])
    Risk-aware urban localization with the Global Navigation Satellite System (GNSS) remains an unsolved problem with frequent misdetection of the user's street or side of the street. Significant advances in 3D map-aided GNSS use grid-based GNSS shadow matching alongside AI-driven line-of-sight (LOS) classifiers and server-based processing to improve localization accuracy, especially in the cross-street direction. Our prior work introduces a new paradigm for shadow matching that proposes set-valued localization with computationally efficient zonotope set representations. While existing literature improved accuracy and efficiency, the current state of shadow matching theory does not address the needs of risk-aware autonomous systems. We extend our prior work to propose Mosaic Zonotope Shadow Matching (MZSM) that employs a classifier-agnostic polytope mosaic architecture to provide risk-awareness and certifiable guarantees on urban positioning. We formulate a recursively expanding binary tree that refines an initial location estimate with set operations into smaller polytopes. Together, the smaller polytopes form a mosaic. We weight the tree branches with the probability that the user is in line of sight of the satellite and expand the tree with each new satellite observation. Our method yields an exact shadow matching distribution from which we guarantee uncertainty bounds on the user localization. We perform high-fidelity simulations using a 3D building map of San Francisco to validate our algorithm's risk-aware improvements. We demonstrate that MZSM provides certifiable guarantees across varied data-driven LOS classifier accuracies and yields a more precise understanding of the uncertainty over existing methods. We validate that our tree-based construction is efficient and tractable, computing a mosaic from 14 satellites in 0.63 seconds and growing quadratically in the satellite number.
    An alternative proof of the vulnerability of retrieval in high intrinsic dimensionality neighborhood. (arXiv:2010.00990v2 [cs.LG] UPDATED)
    This paper investigates the vulnerability of the nearest neighbors search, which is a pivotal tool in data analysis and machine learning. The vulnerability is gauged as the relative amount of perturbation that an attacker needs to add onto a dataset point in order to modify its neighbor rank w.r.t. a query. The statistical distribution of this quantity is derived from simple assumptions. Experiments on six large scale datasets validate this model up to some outliers which are explained in term of violations of the assumptions.
    Breaking the $\sqrt{T}$ Barrier: Instance-Independent Logarithmic Regret in Stochastic Contextual Linear Bandits. (arXiv:2205.09899v1 [stat.ML])
    We prove an instance independent (poly) logarithmic regret for stochastic contextual bandits with linear payoff. Previously, in \cite{chu2011contextual}, a lower bound of $\mathcal{O}(\sqrt{T})$ is shown for the contextual linear bandit problem with arbitrary (adversarily chosen) contexts. In this paper, we show that stochastic contexts indeed help to reduce the regret from $\sqrt{T}$ to $\polylog(T)$. We propose Low Regret Stochastic Contextual Bandits (\texttt{LR-SCB}), which takes advantage of the stochastic contexts and performs parameter estimation (in $\ell_2$ norm) and regret minimization simultaneously. \texttt{LR-SCB} works in epochs, where the parameter estimation of the previous epoch is used to reduce the regret of the current epoch. The (poly) logarithmic regret of \texttt{LR-SCB} stems from two crucial facts: (a) the application of a norm adaptive algorithm to exploit the parameter estimation and (b) an analysis of the shifted linear contextual bandit algorithm, showing that shifting results in increasing regret. We have also shown experimentally that stochastic contexts indeed incurs a regret that scales with $\polylog(T)$.
    Is explainable AI a race against model complexity?. (arXiv:2205.10119v1 [cs.AI])
    Explaining the behaviour of intelligent systems will get increasingly and perhaps intractably challenging as models grow in size and complexity. We may not be able to expect an explanation for every prediction made by a brain-scale model, nor can we expect explanations to remain objective or apolitical. Our functionalist understanding of these models is of less advantage than we might assume. Models precede explanations, and can be useful even when both model and explanation are incorrect. Explainability may never win the race against complexity, but this is less problematic than it seems.
    Nothing makes sense in deep learning, except in the light of evolution. (arXiv:2205.10320v1 [cs.LG])
    Deep Learning (DL) is a surprisingly successful branch of machine learning. The success of DL is usually explained by focusing analysis on a particular recent algorithm and its traits. Instead, we propose that an explanation of the success of DL must look at the population of all algorithms in the field and how they have evolved over time. We argue that cultural evolution is a useful framework to explain the success of DL. In analogy to biology, we use `development' to mean the process converting the pseudocode or text description of an algorithm into a fully trained model. This includes writing the programming code, compiling and running the program, and training the model. If all parts of the process don't align well then the resultant model will be useless (if the code runs at all!). This is a constraint. A core component of evolutionary developmental biology is the concept of deconstraints -- these are modification to the developmental process that avoid complete failure by automatically accommodating changes in other components. We suggest that many important innovations in DL, from neural networks themselves to hyperparameter optimization and AutoGrad, can be seen as developmental deconstraints. These deconstraints can be very helpful to both the particular algorithm in how it handles challenges in implementation and the overall field of DL in how easy it is for new ideas to be generated. We highlight how our perspective can both advance DL and lead to new insights for evolutionary biology.
    On the Representation Collapse of Sparse Mixture of Experts. (arXiv:2204.09179v2 [cs.CL] UPDATED)
    Sparse mixture of experts provides larger model capacity while requiring a constant computational overhead. It employs the routing mechanism to distribute input tokens to the best-matched experts according to their hidden representations. However, learning such a routing mechanism encourages token clustering around expert centroids, implying a trend toward representation collapse. In this work, we propose to estimate the routing scores between tokens and experts on a low-dimensional hypersphere. We conduct extensive experiments on cross-lingual language model pre-training and fine-tuning on downstream tasks. Experimental results across seven multilingual benchmarks show that our method achieves consistent gains. We also present a comprehensive analysis on the representation and routing behaviors of our models. Our method alleviates the representation collapse issue and achieves more consistent routing than the baseline mixture-of-experts methods.
    Towards a Holistic View on Argument Quality Prediction. (arXiv:2205.09803v1 [cs.CL])
    Argumentation is one of society's foundational pillars, and, sparked by advances in NLP and the vast availability of text data, automated mining of arguments receives increasing attention. A decisive property of arguments is their strength or quality. While there are works on the automated estimation of argument strength, their scope is narrow: they focus on isolated datasets and neglect the interactions with related argument mining tasks, such as argument identification, evidence detection, or emotional appeal. In this work, we close this gap by approaching argument quality estimation from multiple different angles: Grounded on rich results from thorough empirical evaluations, we assess the generalization capabilities of argument quality estimation across diverse domains, the interplay with related argument mining tasks, and the impact of emotions on perceived argument strength. We find that generalization depends on a sufficient representation of different domains in the training part. In zero-shot transfer and multi-task experiments, we reveal that argument quality is among the more challenging tasks but can improve others. Finally, we show that emotions play a minor role in argument quality than is often assumed.
    A Survey of Trustworthy Graph Learning: Reliability, Explainability, and Privacy Protection. (arXiv:2205.10014v1 [cs.LG])
    Deep graph learning has achieved remarkable progresses in both business and scientific areas ranging from finance and e-commerce, to drug and advanced material discovery. Despite these progresses, how to ensure various deep graph learning algorithms behave in a socially responsible manner and meet regulatory compliance requirements becomes an emerging problem, especially in risk-sensitive domains. Trustworthy graph learning (TwGL) aims to solve the above problems from a technical viewpoint. In contrast to conventional graph learning research which mainly cares about model performance, TwGL considers various reliability and safety aspects of the graph learning framework including but not limited to robustness, explainability, and privacy. In this survey, we provide a comprehensive review of recent leading approaches in the TwGL field from three dimensions, namely, reliability, explainability, and privacy protection. We give a general categorization for existing work and review typical work for each category. To give further insights for TwGL research, we provide a unified view to inspect previous works and build the connection between them. We also point out some important open problems remaining to be solved in the future developments of TwGL.
    Real Time Multi-Object Detection for Helmet Safety. (arXiv:2205.09878v1 [cs.CV])
    The National Football League and Amazon Web Services teamed up to develop the best sports injury surveillance and mitigation program via the Kaggle competition. Through which the NFL wants to assign specific players to each helmet, which would help accurately identify each player's "exposures" throughout a football play. We are trying to implement a computer vision based ML algorithms capable of assigning detected helmet impacts to correct players via tracking information. Our paper will explain the approach to automatically track player helmets and their collisions. This will also allow them to review previous plays and explore the trends in exposure over time.
    On Tackling Explanation Redundancy in Decision Trees. (arXiv:2205.09971v1 [cs.AI])
    Decision trees (DTs) epitomize the ideal of interpretability of machine learning (ML) models. The interpretability of decision trees motivates explainability approaches by so-called intrinsic interpretability, and it is at the core of recent proposals for applying interpretable ML models in high-risk applications. The belief in DT interpretability is justified by the fact that explanations for DT predictions are generally expected to be succinct. Indeed, in the case of DTs, explanations correspond to DT paths. Since decision trees are ideally shallow, and so paths contain far fewer features than the total number of features, explanations in DTs are expected to be succinct, and hence interpretable. This paper offers both theoretical and experimental arguments demonstrating that, as long as interpretability of decision trees equates with succinctness of explanations, then decision trees ought not be deemed interpretable. The paper introduces logically rigorous path explanations and path explanation redundancy, and proves that there exist functions for which decision trees must exhibit paths with arbitrarily large explanation redundancy. The paper also proves that only a very restricted class of functions can be represented with DTs that exhibit no explanation redundancy. In addition, the paper includes experimental results substantiating that path explanation redundancy is observed ubiquitously in decision trees, including those obtained using different tree learning algorithms, but also in a wide range of publicly available decision trees. The paper also proposes polynomial-time algorithms for eliminating path explanation redundancy, which in practice require negligible time to compute. Thus, these algorithms serve to indirectly attain irreducible, and so succinct, explanations for decision trees.
    Sequentially learning the topological ordering of causal directed acyclic graphs with likelihood ratio scores. (arXiv:2202.01748v2 [stat.ME] UPDATED)
    Causal discovery, the learning of causality in a data mining scenario, has been of strong scientific and theoretical interest as a starting point to identify "what causes what?" Contingent on assumptions and a proper learning algorithm, it is sometimes possible to identify and accurately estimate a causal directed acyclic graph (DAG), as opposed to a Markov equivalence class of graphs that gives ambiguity of causal directions. The focus of this paper is in highlighting the identifiability and estimation of DAGs with general error distributions through a general sequential sorting procedure that orders variables one at a time, starting at root nodes, followed by children of the root nodes, and so on until completion. We demonstrate a novel application of this general approach to estimate the topological ordering of a DAG. At each step of the procedure, only simple likelihood ratio scores are calculated on regression residuals to decide the next node to append to the current partial ordering. The computational complexity of our algorithm on a p-node problem is O(pd), where d is the maximum neighborhood size. Under mild assumptions, the population version of our procedure provably identifies a true ordering of the underlying DAG. We provide extensive numerical evidence to demonstrate that this sequential procedure scales to possibly thousands of nodes and works well for high-dimensional data. We accompany these numerical experiments with an application to a single-cell gene expression dataset.
    Mitigating Statistical Bias within Differentially Private Synthetic Data. (arXiv:2108.10934v3 [stat.ML] UPDATED)
    Increasing interest in privacy-preserving machine learning has led to new and evolved approaches for generating private synthetic data from undisclosed real data. However, mechanisms of privacy preservation can significantly reduce the utility of synthetic data, which in turn impacts downstream tasks such as learning predictive models or inference. We propose several re-weighting strategies using privatised likelihood ratios that not only mitigate statistical bias of downstream estimators but also have general applicability to differentially private generative models. Through large-scale empirical evaluation, we show that private importance weighting provides simple and effective privacy-compliant augmentation for general applications of synthetic data.
    Permutation Predictions for Non-Clairvoyant Scheduling. (arXiv:2202.10199v2 [cs.DS] UPDATED)
    In non-clairvoyant scheduling, the task is to find an online strategy for scheduling jobs with a priori unknown processing requirements with the objective to minimize the total (weighted) completion time. We revisit this well-studied problem in a recently popular learning-augmented setting that integrates (untrusted) predictions in online algorithm design. While previous works used predictions on processing requirements, we propose a new prediction model, which provides a relative order of jobs which could be seen as predicting algorithmic actions rather than parts of the unknown input. We show that these predictions have desired properties, admit a natural error measure as well as algorithms with strong performance guarantees and that they are learnable in both, theory and practice. We generalize the algorithmic framework proposed in the seminal paper by Kumar et al. (NeurIPS'18) and present the first learning-augmented scheduling results for weighted jobs and unrelated machines. We demonstrate in empirical experiments the practicability and superior performance compared to the previously suggested single-machine algorithms.
    Function Regression using Spiking DeepONet. (arXiv:2205.10130v1 [cs.NE])
    One of the main broad applications of deep learning is function regression. However, despite their demonstrated accuracy and robustness, modern neural network architectures require heavy computational resources to train. One method to mitigate or even resolve this inefficiency has been to draw further inspiration from the brain and reformulate the learning process in a more biologically-plausible way, developing what are known as Spiking Neural Networks (SNNs), which have been gaining traction in recent years. In this paper we present an SNN-based method to perform regression, which has been a challenge due to the inherent difficulty in representing a function's input domain and continuous output values as spikes. We use a DeepONet - neural network designed to learn operators - to learn the behavior of spikes. Then, we use this approach to do function regression. We propose several methods to use a DeepONet in the spiking framework, and present accuracy and training time for different benchmarks.
    Millimeter Wave Localization with Imperfect Training Data using Shallow Neural Networks. (arXiv:2112.05008v2 [cs.NI] UPDATED)
    Millimeter wave (mmWave) localization algorithms exploit the quasi-optical propagation of mmWave signals, which yields sparse angular spectra at the receiver. Geometric approaches to angle-based localization typically require to know the map of the environment and the location of the access points. Thus, several works have resorted to automated learning in order to infer a device's location from the properties of the received mmWave signals. However, collecting training data for such models is a significant burden. In this work, we propose a shallow neural network model to localize mmWave devices indoors. This model requires significantly fewer weights than those proposed in the literature. Therefore, it is amenable for implementation in resource-constrained hardware, and needs fewer training samples to converge. We also propose to relieve training data collection efforts by retrieving (inherently imperfect) location estimates from geometry-based mmWave localization algorithms. Even in this case, our results show that the proposed neural networks perform as good as or better than state-of-the-art algorithms.
    Beyond Labels: Visual Representations for Bone Marrow Cell Morphology Recognition. (arXiv:2205.09880v1 [cs.CV])
    Analyzing and inspecting bone marrow cell cytomorphology is a critical but highly complex and time-consuming component of hematopathology diagnosis. Recent advancements in artificial intelligence have paved the way for the application of deep learning algorithms to complex medical tasks. Nevertheless, there are many challenges in applying effective learning algorithms to medical image analysis, such as the lack of sufficient and reliably annotated training datasets and the highly class-imbalanced nature of most medical data. Here, we improve on the state-of-the-art methodologies of bone marrow cell recognition by deviating from sole reliance on labeled data and leveraging self-supervision in training our learning models. We investigate our approach's effectiveness in identifying bone marrow cell types. Our experiments demonstrate significant performance improvements in conducting different bone marrow cell recognition tasks compared to the current state-of-the-art methodologies.
    Towards efficient feature sharing in MIMO architectures. (arXiv:2205.10139v1 [cs.LG])
    Multi-input multi-output architectures propose to train multiple subnetworks within one base network and then average the subnetwork predictions to benefit from ensembling for free. Despite some relative success, these architectures are wasteful in their use of parameters. Indeed, we highlight in this paper that the learned subnetwork fail to share even generic features which limits their applicability on smaller mobile and AR/VR devices. We posit this behavior stems from an ill-posed part of the multi-input multi-output framework. To solve this issue, we propose a novel unmixing step in MIMO architectures that allows subnetworks to properly share features. Preliminary experiments on CIFAR-100 show our adjustments allow feature sharing and improve model performance for small architectures.
    WALNUT: A Benchmark on Weakly Supervised Learning for Natural Language Understanding. (arXiv:2108.12603v2 [cs.CL] UPDATED)
    Building machine learning models for natural language understanding (NLU) tasks relies heavily on labeled data. Weak supervision has been proven valuable when large amount of labeled data is unavailable or expensive to obtain. Existing works studying weak supervision for NLU either mostly focus on a specific task or simulate weak supervision signals from ground-truth labels. It is thus hard to compare different approaches and evaluate the benefit of weak supervision without access to a unified and systematic benchmark with diverse tasks and real-world weak labeling rules. In this paper, we propose such a benchmark, named WALNUT (semi-WeAkly supervised Learning for Natural language Understanding Testbed), to advocate and facilitate research on weak supervision for NLU. WALNUT consists of NLU tasks with different types, including document-level and token-level prediction tasks. WALNUT is the first semi-weakly supervised learning benchmark for NLU, where each task contains weak labels generated by multiple real-world weak sources, together with a small set of clean labels. We conduct baseline evaluations on WALNUT to systematically evaluate the effectiveness of various weak supervision methods and model architectures. Our results demonstrate the benefit of weak supervision for low-resource NLU tasks and highlight interesting patterns across tasks. We expect WALNUT to stimulate further research on methodologies to leverage weak supervision more effectively. The benchmark and code for baselines are available at \url{aka.ms/walnut_benchmark}.
    HyBNN and FedHyBNN: (Federated) Hybrid Binary Neural Networks. (arXiv:2205.09839v1 [cs.LG])
    Binary Neural Networks (BNNs), neural networks with weights and activations constrained to -1(0) and +1, are an alternative to deep neural networks which offer faster training, lower memory consumption and lightweight models, ideal for use in resource constrained devices while being able to utilize the architecture of their deep neural network counterpart. However, the input binarization step used in BNNs causes a severe accuracy loss. In this paper, we introduce a novel hybrid neural network architecture, Hybrid Binary Neural Network (HyBNN), consisting of a task-independent, general, full-precision variational autoencoder with a binary latent space and a task specific binary neural network that is able to greatly limit the accuracy loss due to input binarization by using the full precision variational autoencoder as a feature extractor. We use it to combine the state-of-the-art accuracy of deep neural networks with the much faster training time, quicker test-time inference and power efficiency of binary neural networks. We show that our proposed system is able to very significantly outperform a vanilla binary neural network with input binarization. We also introduce FedHyBNN, a highly communication efficient federated counterpart to HyBNN and demonstrate that it is able to reach the same accuracy as its non-federated equivalent. We make our source code, experimental parameters and models available at: https://anonymous.4open.science/r/HyBNN.
    MaskGAE: Masked Graph Modeling Meets Graph Autoencoders. (arXiv:2205.10053v1 [cs.LG])
    We present masked graph autoencoder (MaskGAE), a self-supervised learning framework for graph-structured data. Different from previous graph autoencoders (GAEs), MaskGAE adopts masked graph modeling (MGM) as a principled pretext task: masking a portion of edges and attempting to reconstruct the missing part with partially visible, unmasked graph structure. To understand whether MGM can help GAEs learn better representations, we provide both theoretical and empirical evidence to justify the benefits of this pretext task. Theoretically, we establish the connections between GAEs and contrastive learning, showing that MGM significantly improves the self-supervised learning scheme of GAEs. Empirically, we conduct extensive experiments on a number of benchmark datasets, demonstrating the superiority of MaskGAE over several state-of-the-arts on both link prediction and node classification tasks. Our code is publicly available at \url{https://github.com/EdisonLeeeee/MaskGAE}.
    Stochastic resonance neurons in artificial neural networks. (arXiv:2205.10122v1 [cs.NE])
    Many modern applications of the artificial neural networks ensue large number of layers making traditional digital implementations increasingly complex. Optical neural networks offer parallel processing at high bandwidth, but have the challenge of noise accumulation. We propose here a new type of neural networks using stochastic resonances as an inherent part of the architecture and demonstrate a possibility of significant reduction of the required number of neurons for a given performance accuracy. We also show that such a neural network is more robust against the impact of noise.
    A Rule Search Framework for the Early Identification of Chronic Emergency Homeless Shelter Clients. (arXiv:2205.09883v1 [cs.CY])
    This paper uses rule search techniques for the early identification of emergency homeless shelter clients who are at risk of becoming long term or chronic shelter users. Using a data set from a major North American shelter containing 12 years of service interactions with over 40,000 individuals, the optimized pruning for unordered search (OPUS) algorithm is used to develop rules that are both intuitive and effective. The rules are evaluated within a framework compatible with the real-time delivery of a housing program meant to transition high risk clients to supportive housing. Results demonstrate that the median time to identification of clients at risk of chronic shelter use drops from 297 days to 162 days when the methods in this paper are applied.
    On the Prediction Instability of Graph Neural Networks. (arXiv:2205.10070v1 [cs.LG])
    Instability of trained models, i.e., the dependence of individual node predictions on random factors, can affect reproducibility, reliability, and trust in machine learning systems. In this paper, we systematically assess the prediction instability of node classification with state-of-the-art Graph Neural Networks (GNNs). With our experiments, we establish that multiple instantiations of popular GNN models trained on the same data with the same model hyperparameters result in almost identical aggregated performance but display substantial disagreement in the predictions for individual nodes. We find that up to one third of the incorrectly classified nodes differ across algorithm runs. We identify correlations between hyperparameters, node properties, and the size of the training set with the stability of predictions. In general, maximizing model performance implicitly also reduces model instability.
    Self-Paced Multi-Agent Reinforcement Learning. (arXiv:2205.10016v1 [cs.AI])
    Curriculum reinforcement learning (CRL) aims to speed up learning of a task by changing gradually the difficulty of the task from easy to hard through control of factors such as initial state or environment dynamics. While automating CRL is well studied in the single-agent setting, in multi-agent reinforcement learning (MARL) an open question is whether control of the number of agents with other factors in a principled manner is beneficial, prior approaches typically relying on hand-crafted heuristics. In addition, how the tasks evolve as the number of agents changes remains understudied, which is critical for scaling to more challenging tasks. We introduce self-paced MARL (SPMARL) that enables optimizing the number of agents with other environment factors in a principled way, and, show that usual assumptions such as that fewer agents make the task always easier are not generally valid. The curriculum induced by SPMARL reveals the evolution of tasks w.r.t. number of agents and experiments show that SPMARL improves the performance when the number of agents sufficiently influences task difficulty.
    HCMD-zero: Learning Value Aligned Mechanisms from Data. (arXiv:2202.10122v2 [cs.MA] UPDATED)
    Artificial learning agents are mediating a larger and larger number of interactions among humans, firms, and organizations, and the intersection between mechanism design and machine learning has been heavily investigated in recent years. However, mechanism design methods often make strong assumptions on how participants behave (e.g. rationality), on the kind of knowledge designers have access to a priori (e.g. access to strong baseline mechanisms), or on what the goal of the mechanism should be (e.g. total welfare). Here we introduce HCMD-zero, a general purpose method to construct mechanisms making none of these three assumptions. HCMD-zero learns to mediate interactions among participants and adjusts the mechanism parameters to make itself more likely to be preferred by participants. It does so by remaining engaged in an electoral contest with copies of itself, thereby accessing direct feedback from participants. We test our method on a stylized resource allocation game that highlights the tension between productivity, equality and the temptation to free ride. HCMD-zero produces a mechanism that is preferred by human participants over a strong baseline, it does so automatically, without requiring prior knowledge, and using human behavioral trajectories sparingly and effectively. Our analysis shows HCMD-zero consistently makes the mechanism policy more and more likely to be preferred by human participants over the course of training, and that it results in a mechanism with an interpretable and intuitive policy.
    How to Minimize the Weighted Sum AoI in Multi-Source Status Update Systems: OMA or NOMA?. (arXiv:2205.03143v2 [cs.IT] UPDATED)
    In this paper, the minimization of the weighted sum average age of information (AoI) in a multi-source status update communication system is studied. Multiple independent sources send update packets to a common destination node in a time-slotted manner under the limit of maximum retransmission rounds. Different multiple access schemes, i.e., orthogonal multiple access (OMA) and non-orthogonal multiple access (NOMA) are exploited here over a block-fading multiple access channel (MAC). Constrained Markov decision process (CMDP) problems are formulated to describe the AoI minimization problems considering both transmission schemes. The Lagrangian method is utilised to convert CMDP problems to unconstraint Markov decision process (MDP) problems and corresponding algorithms to derive the power allocation policies are obtained. On the other hand, for the case of unknown environments, two online reinforcement learning approaches considering both multiple access schemes are proposed to achieve near-optimal age performance. Numerical simulations validate the improvement of the proposed policy in terms of weighted sum AoI compared to the fixed power transmission policy, and illustrate that NOMA is more favorable in case of larger packet size.
    Deep electric field predictions by drift-reduced Braginskii theory with plasma-neutral interactions based upon experimental images of boundary turbulence. (arXiv:2204.11689v1 [physics.plasm-ph] CROSS LISTED)
    We present 2-dimensional turbulent electric field calculations via physics-informed deep learning consistent with (i) drift-reduced Braginskii theory under the framework of an axisymmetric fusion plasma with purely toroidal field and (ii) experimental estimates of the fluctuating electron density and temperature obtained from analysis of gas puff imaging of a discharge on the Alcator C-Mod tokamak. The inclusion of effects from the locally puffed atomic helium on particle and energy sources within the reduced plasma turbulence model are found to strengthen correlations between the electric field and electron pressure. The neutrals are also directly associated with an observed broadening in the distribution of turbulent field amplitudes and increased ${\bf E \times B}$ shearing rates.
    DDDM: a Brain-Inspired Framework for Robust Classification. (arXiv:2205.10117v1 [cs.NE])
    Despite their outstanding performance in a broad spectrum of real-world tasks, deep artificial neural networks are sensitive to input noises, particularly adversarial perturbations. On the contrary, human and animal brains are much less vulnerable. In contrast to the one-shot inference performed by most deep neural networks, the brain often solves decision-making with an evidence accumulation mechanism that may trade time for accuracy when facing noisy inputs. The mechanism is well described by the Drift-Diffusion Model (DDM). In the DDM, decision-making is modeled as a process in which noisy evidence is accumulated toward a threshold. Drawing inspiration from the DDM, we propose the Dropout-based Drift-Diffusion Model (DDDM) that combines test-phase dropout and the DDM for improving the robustness for arbitrary neural networks. The dropouts create temporally uncorrelated noises in the network that counter perturbations, while the evidence accumulation mechanism guarantees a reasonable decision accuracy. Neural networks enhanced with the DDDM tested in image, speech, and text classification tasks all significantly outperform their native counterparts, demonstrating the DDDM as a task-agnostic defense against adversarial attacks.
    Classifying Human Activities using Machine Learning and Deep Learning Techniques. (arXiv:2205.10325v1 [cs.LG])
    Human Activity Recognition (HAR) describes the machines ability to recognize human actions. Nowadays, most people on earth are health conscious, so people are more interested in tracking their daily activities using Smartphones or Smart Watches, which can help them manage their daily routines in a healthy way. With this objective, Kaggle has conducted a competition to classify 6 different human activities distinctly based on the inertial signals obtained from 30 volunteers smartphones. The main challenge in HAR is to overcome the difficulties of separating human activities based on the given data such that no two activities overlap. In this experimentation, first, Data visualization is done on expert generated features with the help of t distributed Stochastic Neighborhood Embedding followed by applying various Machine Learning techniques like Logistic Regression, Linear SVC, Kernel SVM, Decision trees to better classify the 6 distinct human activities. Moreover, Deep Learning techniques like Long Short-Term Memory (LSTM), Bi-Directional LSTM, Recurrent Neural Network (RNN), and Gated Recurrent Unit (GRU) are trained using raw time series data. Finally, metrics like Accuracy, Confusion matrix, precision and recall are used to evaluate the performance of the Machine Learning and Deep Learning models. Experiment results proved that the Linear Support Vector Classifier in machine learning and Gated Recurrent Unit in Deep Learning provided better accuracy for human activity recognition compared to other classifiers.
    Hybrid Transfer in Deep Reinforcement Learning for Ads Allocation. (arXiv:2204.11589v2 [cs.IR] UPDATED)
    Ads allocation, which involves allocating ads and organic items to limited slots in feed with the purpose of maximizing platform revenue, has become a research hotspot. Notice that, e-commerce platforms usually have multiple entrances for different categories and some entrances have few visits. Data from these entrances has low coverage, which makes it difficult for the agent to learn. To address this challenge, we propose Similarity-based Hybrid Transfer for Ads Allocation (SHTAA), which effectively transfers samples as well as knowledge from data-rich entrance to data-poor entrance. Specifically, we define an uncertainty-aware similarity for MDP to estimate the similarity of MDP for different entrances. Based on this similarity, we design a hybrid transfer method, including instance transfer and strategy transfer, to efficiently transfer samples and knowledge from one entrance to another. Both offline and online experiments on Meituan food delivery platform demonstrate that the proposed method could achieve better performance for data-poor entrance and increase the revenue for the platform.
    Can Foundation Models Wrangle Your Data?. (arXiv:2205.09911v1 [cs.LG])
    Foundation Models (FMs) are models trained on large corpora of data that, at very large scale, can generalize to new tasks without any task-specific finetuning. As these models continue to grow in size, innovations continue to push the boundaries of what these models can do on language and image tasks. This paper aims to understand an underexplored area of FMs: classical data tasks like cleaning and integration. As a proof-of-concept, we cast three data cleaning and integration tasks as prompting tasks and evaluate the performance of FMs on these tasks. We find that large FMs generalize and achieve SoTA performance on data cleaning and integration tasks, even though they are not trained for these data tasks. We identify specific research challenges and opportunities that these models present, including challenges with private and temporal data, and opportunities to make data driven systems more accessible to non-experts. We make our code and experiments publicly available at: https://github.com/HazyResearch/fm_data_tasks.
    Federated learning for violence incident prediction in a simulated cross-institutional psychiatric setting. (arXiv:2205.10234v1 [cs.CL])
    Inpatient violence is a common and severe problem within psychiatry. Knowing who might become violent can influence staffing levels and mitigate severity. Predictive machine learning models can assess each patient's likelihood of becoming violent based on clinical notes. Yet, while machine learning models benefit from having more data, data availability is limited as hospitals typically do not share their data for privacy preservation. Federated Learning (FL) can overcome the problem of data limitation by training models in a decentralised manner, without disclosing data between collaborators. However, although several FL approaches exist, none of these train Natural Language Processing models on clinical notes. In this work, we investigate the application of Federated Learning to clinical Natural Language Processing, applied to the task of Violence Risk Assessment by simulating a cross-institutional psychiatric setting. We train and compare four models: two local models, a federated model and a data-centralised model. Our results indicate that the federated model outperforms the local models and has similar performance as the data-centralised model. These findings suggest that Federated Learning can be used successfully in a cross-institutional setting and is a step towards new applications of Federated Learning based on clinical notes
    Enhanced Temporal Knowledge Embeddings with Contextualized Language Representations. (arXiv:2203.09590v3 [cs.CL] UPDATED)
    Within the emerging research efforts to combine structured and unstructured knowledge, many approaches incorporate factual knowledge, e.g., available in form of structured knowledge graphs (KGs), into pre-trained language models (PLMs) and then apply the knowledge-enhanced PLMs to downstream NLP tasks. However, (1) they typically only consider \textit{static} factual knowledge, whereas, e.g., knowledge graphs (KGs) also contain \textit{temporal facts} or \textit{events} indicating evolutionary relationships among entities at different timestamps. (2) PLMs cannot be directly applied to many KG tasks, such as temporal KG completion. In this paper, we focus on \textbf{e}nhancing temporal knowledge embeddings with \textbf{co}ntextualized \textbf{la}nguage representations (ECOLA). We align structured knowledge, contained in temporal knowledge graphs, with their textual descriptions extracted from news articles, and propose a novel knowledge-text prediction task to inject the abundant information from descriptions into temporal knowledge embeddings. ECOLA jointly optimizes the knowledge-text prediction objective and the temporal knowledge embeddings, which can simultaneously take full advantage of textual and knowledge information. The proposed fusion method is model-agnostic and can be combined with potentially any temporal KG model. For training ECOLA, we introduce three temporal KG datasets with aligned textual descriptions. Experimental results on the temporal knowledge graph completion task show that ECOLA outperforms state-of-the-art temporal KG models by a large margin. The proposed datasets can serve as new temporal KG benchmarks and facilitate future research on structured and unstructured knowledge integration.
    Learning a Large Neighborhood Search Algorithm for Mixed Integer Programs. (arXiv:2107.10201v3 [math.OC] UPDATED)
    Large Neighborhood Search (LNS) is a combinatorial optimization heuristic that starts with an assignment of values for the variables to be optimized, and iteratively improves it by searching a large neighborhood around the current assignment. In this paper we consider a learning-based LNS approach for mixed integer programs (MIPs). We train a Neural Diving model to represent a probability distribution over assignments, which, together with an off-the-shelf MIP solver, generates an initial assignment. Formulating the subsequent search steps as a Markov Decision Process, we train a Neural Neighborhood Selection policy to select a search neighborhood at each step, which is searched using a MIP solver to find the next assignment. The policy network is trained using imitation learning. We propose a target policy for imitation that, given enough compute resources, is guaranteed to select the neighborhood containing the optimal next assignment amongst all possible choices for the neighborhood of a specified size. Our approach matches or outperforms all the baselines on five real-world MIP datasets with large-scale instances from diverse applications, including two production applications at Google. It achieves $2\times$ to $37.8\times$ better average primal gap than the best baseline on three of the datasets at large running times.
    Learning Task-relevant Representations for Generalization via Characteristic Functions of Reward Sequence Distributions. (arXiv:2205.10218v1 [cs.LG])
    Generalization across different environments with the same tasks is critical for successful applications of visual reinforcement learning (RL) in real scenarios. However, visual distractions -- which are common in real scenes -- from high-dimensional observations can be hurtful to the learned representations in visual RL, thus degrading the performance of generalization. To tackle this problem, we propose a novel approach, namely Characteristic Reward Sequence Prediction (CRESP), to extract the task-relevant information by learning reward sequence distributions (RSDs), as the reward signals are task-relevant in RL and invariant to visual distractions. Specifically, to effectively capture the task-relevant information via RSDs, CRESP introduces an auxiliary task -- that is, predicting the characteristic functions of RSDs -- to learn task-relevant representations, because we can well approximate the high-dimensional distributions by leveraging the corresponding characteristic functions. Experiments demonstrate that CRESP significantly improves the performance of generalization on unseen environments, outperforming several state-of-the-arts on DeepMind Control tasks with different visual distractions.
    Explanatory machine learning for sequential human teaching. (arXiv:2205.10250v1 [cs.AI])
    The topic of comprehensibility of machine-learned theories has recently drawn increasing attention. Inductive Logic Programming (ILP) uses logic programming to derive logic theories from small data based on abduction and induction techniques. Learned theories are represented in the form of rules as declarative descriptions of obtained knowledge. In earlier work, the authors provided the first evidence of a measurable increase in human comprehension based on machine-learned logic rules for simple classification tasks. In a later study, it was found that the presentation of machine-learned explanations to humans can produce both beneficial and harmful effects in the context of game learning. We continue our investigation of comprehensibility by examining the effects of the ordering of concept presentations on human comprehension. In this work, we examine the explanatory effects of curriculum order and the presence of machine-learned explanations for sequential problem-solving. We show that 1) there exist tasks A and B such that learning A before B has a better human comprehension with respect to learning B before A and 2) there exist tasks A and B such that the presence of explanations when learning A contributes to improved human comprehension when subsequently learning B. We propose a framework for the effects of sequential teaching on comprehension based on an existing definition of comprehensibility and provide evidence for support from data collected in human trials. Empirical results show that sequential teaching of concepts with increasing complexity a) has a beneficial effect on human comprehension and b) leads to human re-discovery of divide-and-conquer problem-solving strategies, and c) studying machine-learned explanations allows adaptations of human problem-solving strategy with better performance.
    A Review of Safe Reinforcement Learning: Methods, Theory and Applications. (arXiv:2205.10330v1 [cs.AI])
    Reinforcement learning has achieved tremendous success in many complex decision making tasks. When it comes to deploying RL in the real world, safety concerns are usually raised, leading to a growing demand for safe reinforcement learning algorithms, such as in autonomous driving and robotics scenarios. While safety control has a long history, the study of safe RL algorithms is still in the early stages. To establish a good foundation for future research in this thread, in this paper, we provide a review for safe RL from the perspectives of methods, theory and applications. Firstly, we review the progress of safe RL from five dimensions and come up with five problems that are crucial for safe RL being deployed in real-world applications, coined as "2H3W". Secondly, we analyze the theory and algorithm progress from the perspectives of answering the "2H3W" problems. Then, the sample complexity of safe RL methods is reviewed and discussed, followed by an introduction of the applications and benchmarks of safe RL algorithms. Finally, we open the discussion of the challenging problems in safe RL, hoping to inspire more future research on this thread. To advance the study of safe RL algorithms, we release a benchmark suite, an open-sourced repository containing the implementations of major safe RL algorithms, along with tutorials at the link: https://github.com/chauncygu/Safe-Reinforcement-Learning-Baselines.git.
    Automated Scoring for Reading Comprehension via In-context BERT Tuning. (arXiv:2205.09864v1 [cs.LG])
    Automated scoring of open-ended student responses has the potential to significantly reduce human grader effort. Recent advances in automated scoring often leverage textual representations based on pre-trained language models such as BERT and GPT as input to scoring models. Most existing approaches train a separate model for each item/question, which is suitable for scenarios such as essay scoring where items can be quite different from one another. However, these approaches have two limitations: 1) they fail to leverage item linkage for scenarios such as reading comprehension where multiple items may share a reading passage; 2) they are not scalable since storing one model per item becomes difficult when models have a large number of parameters. In this paper, we report our (grand prize-winning) solution to the National Assessment of Education Progress (NAEP) automated scoring challenge for reading comprehension. Our approach, in-context BERT fine-tuning, produces a single shared scoring model for all items with a carefully-designed input structure to provide contextual information on each item. We demonstrate the effectiveness of our approach via local evaluations using the training dataset provided by the challenge. We also discuss the biases, common error types, and limitations of our approach.
    Posterior Refinement Improves Sample Efficiency in Bayesian Neural Networks. (arXiv:2205.10041v1 [cs.LG])
    Monte Carlo (MC) integration is the de facto method for approximating the predictive distribution of Bayesian neural networks (BNNs). But, even with many MC samples, Gaussian-based BNNs could still yield bad predictive performance due to the posterior approximation's error. Meanwhile, alternatives to MC integration tend to be more expensive or biased. In this work, we experimentally show that the key to good MC-approximated predictive distributions is the quality of the approximate posterior itself. However, previous methods for obtaining accurate posterior approximations are expensive and non-trivial to implement. We, therefore, propose to refine Gaussian approximate posteriors with normalizing flows. When applied to last-layer BNNs, it yields a simple \emph{post hoc} method for improving pre-existing parametric approximations. We show that the resulting posterior approximation is competitive with even the gold-standard full-batch Hamiltonian Monte Carlo.
    Algorithms for Weak Optimal Transport with an Application to Economics. (arXiv:2205.09825v1 [stat.ML])
    The theory of weak optimal transport (WOT), introduced by [Gozlan et al., 2017], generalizes the classic Monge-Kantorovich framework by allowing the transport cost between one point and the points it is matched with to be nonlinear. In the so-called barycentric version of WOT, the cost for transporting a point $x$ only depends on $x$ and on the barycenter of the points it is matched with. This aggregation property of WOT is appealing in machine learning, economics and finance. Yet algorithms to compute WOT have only been developed for the special case of quadratic barycentric WOT, or depend on neural networks with no guarantee on the computed value and matching. The main difficulty lies in the transportation constraints which are costly to project onto. In this paper, we propose to use mirror descent algorithms to solve the primal and dual versions of the WOT problem. We also apply our algorithms to the variant of WOT introduced by [Chon\'e et al., 2022] where mass is distributed from one space to another through unnormalized kernels (WOTUK). We empirically compare the solutions of WOT and WOTUK with classical OT. We illustrate our numerical methods to the economic framework of [Chon\'e and Kramarz, 2021], namely the matching between workers and firms on labor markets.
    Speeding up PCA with priming. (arXiv:2109.03709v3 [cs.LG] UPDATED)
    We introduce primed-PCA (pPCA), a two-step algorithm for speeding up the approximation of principal components. This algorithm first runs any approximate-PCA method to get an initial estimate of the principal components (priming), and then applies an exact PCA in the subspace they span. Since this subspace is of small dimension in any practical use, the second step is extremely cheap computationally. Nonetheless, it improves accuracy significantly for a given computational budget across datasets. In this setup, the purpose of the priming is to narrow down the search space, and prepare the data for the second step, an exact calculation. We show formally that pPCA improves upon the priming algorithm under very mild conditions, and we provide experimental validation on both synthetic and real large-scale datasets showing that it systematically translates to improved performance. In our experiments we prime pPCA by several approximate algorithms and report an average speedup by a factor of 7.2 over Oja's rule, and a factor of 10.5 over EigenGame.
    Track Boosting and Synthetic Data Aided Drone Detection. (arXiv:2111.12389v5 [cs.CV] UPDATED)
    This is the paper for the first place winning solution of the Drone vs. Bird Challenge, organized by AVSS 2021. As the usage of drones increases with lowered costs and improved drone technology, drone detection emerges as a vital object detection task. However, detecting distant drones under unfavorable conditions, namely weak contrast, long-range, low visibility, requires effective algorithms. Our method approaches the drone detection problem by fine-tuning a YOLOv5 model with real and synthetically generated data using a Kalman-based object tracker to boost detection confidence. Our results indicate that augmenting the real data with an optimal subset of synthetic data can increase the performance. Moreover, temporal information gathered by object tracking methods can increase performance further.
    Robust Multi-Task Learning and Online Refinement for Spacecraft Pose Estimation across Domain Gap. (arXiv:2203.04275v3 [cs.CV] UPDATED)
    This work presents Spacecraft Pose Network v2 (SPNv2), a Convolutional Neural Network (CNN) for pose estimation of noncooperative spacecraft across domain gap. SPNv2 is a multi-scale, multi-task CNN which consists of a shared multi-scale feature encoder and multiple prediction heads that perform different tasks on a shared feature output. These tasks are all related to detection and pose estimation of a target spacecraft from an image, such as prediction of pre-defined satellite keypoints, direct pose regression, and binary segmentation of the satellite foreground. It is shown that by jointly training on different yet related tasks with extensive data augmentations on synthetic images only, the shared encoder learns features that are common across image domains that have fundamentally different visual characteristics compared to synthetic images. This work also introduces Online Domain Refinement (ODR) which refines the parameters of the normalization layers of SPNv2 on the target domain images online at deployment. Specifically, ODR performs self-supervised entropy minimization of the predicted satellite foreground, thereby improving the CNN's performance on the target domain images without their pose labels and with minimal computational efforts. The GitHub repository for SPNv2 is available at \url{https://github.com/tpark94/spnv2}.
    Content-Context Factorized Representations for Automated Speech Recognition. (arXiv:2205.09872v1 [eess.AS])
    Deep neural networks have largely demonstrated their ability to perform automated speech recognition (ASR) by extracting meaningful features from input audio frames. Such features, however, may consist not only of information about the spoken language content, but also may contain information about unnecessary contexts such as background noise and sounds or speaker identity, accent, or protected attributes. Such information can directly harm generalization performance, by introducing spurious correlations between the spoken words and the context in which such words were spoken. In this work, we introduce an unsupervised, encoder-agnostic method for factoring speech-encoder representations into explicit content-encoding representations and spurious context-encoding representations. By doing so, we demonstrate improved performance on standard ASR benchmarks, as well as improved performance in both real-world and artificially noisy ASR scenarios.
    Delator: Automatic Detection of Money Laundering Evidence on Transaction Graphs via Neural Networks. (arXiv:2205.10293v1 [cs.LG])
    Money laundering is one of the most relevant criminal activities today, due to its potential to cause massive financial losses to governments, banks, etc. We propose DELATOR, a new CAAT (computer-assisted audit technology) to detect money laundering activities based on neural network models that encode bank transfers as a large-scale temporal graph. In collaboration with a Brazilian bank, we design and apply an evaluation strategy to quantify DELATOR's performance on historic data comprising millions of clients. DELATOR outperforms an off-the-shelf solution from Amazon AWS by 18.9% with respect to AUC. We conducted real experiments that led to discovery of 8 new suspicious among 100 analyzed cases, which would have been reported to the authorities under the current criteria.
    What's the Harm? Sharp Bounds on the Fraction Negatively Affected by Treatment. (arXiv:2205.10327v1 [stat.ME])
    The fundamental problem of causal inference -- that we never observe counterfactuals -- prevents us from identifying how many might be negatively affected by a proposed intervention. If, in an A/B test, half of users click (or buy, or watch, or renew, etc.), whether exposed to the standard experience A or a new one B, hypothetically it could be because the change affects no one, because the change positively affects half the user population to go from no-click to click while negatively affecting the other half, or something in between. While unknowable, this impact is clearly of material importance to the decision to implement a change or not, whether due to fairness, long-term, systemic, or operational considerations. We therefore derive the tightest-possible (i.e., sharp) bounds on the fraction negatively affected (and other related estimands) given data with only factual observations, whether experimental or observational. Naturally, the more we can stratify individuals by observable covariates, the tighter the sharp bounds. Since these bounds involve unknown functions that must be learned from data, we develop a robust inference algorithm that is efficient almost regardless of how and how fast these functions are learned, remains consistent when some are mislearned, and still gives valid conservative bounds when most are mislearned. Our methodology altogether therefore strongly supports credible conclusions: it avoids spuriously point-identifying this unknowable impact, focusing on the best bounds instead, and it permits exceedingly robust inference on these. We demonstrate our method in simulation studies and in a case study of career counseling for the unemployed.
    Multi-Graph Convolutional-Recurrent Neural Network (MGC-RNN) for Short-Term Forecasting of Transit Passenger Flow. (arXiv:2107.13226v2 [cs.LG] UPDATED)
    Short-term forecasting of passenger flow is critical for transit management and crowd regulation. Spatial dependencies, temporal dependencies, inter-station correlations driven by other latent factors, and exogenous factors bring challenges to the short-term forecasts of passenger flow of urban rail transit networks. An innovative deep learning approach, Multi-Graph Convolutional-Recurrent Neural Network (MGC-RNN) is proposed to forecast passenger flow in urban rail transit systems to incorporate these complex factors. We propose to use multiple graphs to encode the spatial and other heterogenous inter-station correlations. The temporal dynamics of the inter-station correlations are also modeled via the proposed multi-graph convolutional-recurrent neural network structure. Inflow and outflow of all stations can be collectively predicted with multiple time steps ahead via a sequence to sequence(seq2seq) architecture. The proposed method is applied to the short-term forecasts of passenger flow in Shenzhen Metro, China. The experimental results show that MGC-RNN outperforms the benchmark algorithms in terms of forecasting accuracy. Besides, it is found that the inter-station driven by network distance, network structure, and recent flow patterns are significant factors for passenger flow forecasting. Moreover, the architecture of LSTM-encoder-decoder can capture the temporal dependencies well. In general, the proposed framework could provide multiple views of passenger flow dynamics for fine prediction and exhibit a possibility for multi-source heterogeneous data fusion in the spatiotemporal forecast tasks.
    Adaptor: Objective-Centric Adaptation Framework for Language Models. (arXiv:2203.03989v2 [cs.CL] UPDATED)
    Progress in natural language processing research is catalyzed by the possibilities given by the widespread software frameworks. This paper introduces Adaptor library that transposes the traditional model-centric approach composed of pre-training + fine-tuning steps to objective-centric approach, composing the training process by applications of selected objectives. We survey research directions that can benefit from enhanced objective-centric experimentation in multitask training, custom objectives development, dynamic training curricula, or domain adaptation. Adaptor aims to ease reproducibility of these research directions in practice. Finally, we demonstrate the practical applicability of Adaptor in selected unsupervised domain adaptation scenarios.
    Capturing cross-session neural population variability through self-supervised identification of consistent neuron ensembles. (arXiv:2205.09829v1 [q-bio.NC])
    Decoding stimuli or behaviour from recorded neural activity is a common approach to interrogate brain function in research, and an essential part of brain-computer and brain-machine interfaces. Reliable decoding even from small neural populations is possible because high dimensional neural population activity typically occupies low dimensional manifolds that are discoverable with suitable latent variable models. Over time however, drifts in activity of individual neurons and instabilities in neural recording devices can be substantial, making stable decoding over days and weeks impractical. While this drift cannot be predicted on an individual neuron level, population level variations over consecutive recording sessions such as differing sets of neurons and varying permutations of consistent neurons in recorded data may be learnable when the underlying manifold is stable over time. Classification of consistent versus unfamiliar neurons across sessions and accounting for deviations in the order of consistent recording neurons in recording datasets over sessions of recordings may then maintain decoding performance. In this work we show that self-supervised training of a deep neural network can be used to compensate for this inter-session variability. As a result, a sequential autoencoding model can maintain state-of-the-art behaviour decoding performance for completely unseen recording sessions several days into the future. Our approach only requires a single recording session for training the model, and is a step towards reliable, recalibration-free brain computer interfaces.
    Evolving SimGANs to Improve Abnormal Electrocardiogram Classification. (arXiv:2205.10116v1 [cs.NE])
    Machine Learning models are used in a wide variety of domains. However, machine learning methods often require a large amount of data in order to be successful. This is especially troublesome in domains where collecting real-world data is difficult and/or expensive. Data simulators do exist for many of these domains, but they do not sufficiently reflect the real world data due to factors such as a lack of real-world noise. Recently generative adversarial networks (GANs) have been modified to refine simulated image data into data that better fits the real world distribution, using the SimGAN method. While evolutionary computing has been used for GAN evolution, there are currently no frameworks that can evolve a SimGAN. In this paper we (1) extend the SimGAN method to refine one-dimensional data, (2) modify Easy Cartesian Genetic Programming (ezCGP), an evolutionary computing framework, to create SimGANs that more accurately refine simulated data, and (3) create new feature-based quantitative metrics to evaluate refined data. We also use our framework to augment an electrocardiogram (ECG) dataset, a domain that suffers from the issues previously mentioned. In particular, while healthy ECGs can be simulated there are no current simulators of abnormal ECGs. We show that by using an evolved SimGAN to refine simulated healthy ECG data to mimic real-world abnormal ECGs, we can improve the accuracy of abnormal ECG classifiers.
    LeNSE: Learning To Navigate Subgraph Embeddings for Large-Scale Combinatorial Optimisation. (arXiv:2205.10106v1 [cs.LG])
    Combinatorial Optimisation problems arise in several application domains and are often formulated in terms of graphs. Many of these problems are NP-hard, but exact solutions are not always needed. Several heuristics have been developed to provide near-optimal solutions; however, they do not typically scale well with the size of the graph. We propose a low-complexity approach for identifying a (possibly much smaller) subgraph of the original graph where the heuristics can be run in reasonable time and with a high likelihood of finding a global near-optimal solution. The core component of our approach is LeNSE, a reinforcement learning algorithm that learns how to navigate the space of possible subgraphs using an Euclidean subgraph embedding as its map. To solve CO problems, LeNSE is provided with a discriminative embedding trained using any existing heuristics using only on a small portion of the original graph. When tested on three problems (vertex cover, max-cut and influence maximisation) using real graphs with up to $10$ million edges, LeNSE identifies small subgraphs yielding solutions comparable to those found by running the heuristics on the entire graph, but at a fraction of the total run time.
    A Unified Approach to Synchronization Problems over Subgroups of the Orthogonal Group. (arXiv:2009.07514v2 [math.OC] UPDATED)
    The problem of synchronization over a group $\mathcal{G}$ aims to estimate a collection of group elements $G^*_1, \dots, G^*_n \in \mathcal{G}$ based on noisy observations of a subset of all pairwise ratios of the form $G^*_i {G^*_j}^{-1}$. Such a problem has gained much attention recently and finds many applications across a wide range of scientific and engineering areas. In this paper, we consider the class of synchronization problems in which the group is a closed subgroup of the orthogonal group. This class covers many group synchronization problems that arise in practice. Our contribution is fivefold. First, we propose a unified approach for solving this class of group synchronization problems, which consists of a suitable initialization step and an iterative refinement step based on the generalized power method, and show that it enjoys a strong theoretical guarantee on the estimation error under certain assumptions on the group, measurement graph, noise, and initialization. Second, we formulate two geometric conditions that are required by our approach and show that they hold for various practically relevant subgroups of the orthogonal group. The conditions are closely related to the error-bound geometry of the subgroup -- an important notion in optimization. Third, we verify the assumptions on the measurement graph and noise for standard random graph and random matrix models. Fourth, based on the classic notion of metric entropy, we develop and analyze a novel spectral-type estimator. Finally, we show via extensive numerical experiments that our proposed non-convex approach outperforms existing approaches in terms of computational speed, scalability, and/or estimation error.
    Global and Individualized Community Detection in Inhomogeneous Multilayer Networks. (arXiv:2012.00933v3 [math.ST] UPDATED)
    In network applications, it has become increasingly common to obtain datasets in the form of multiple networks observed on the same set of subjects, where each network is obtained in a related but different experiment condition or application scenario. Such datasets can be modeled by multilayer networks where each layer is a separate network itself while different layers are associated and share some common information. The present paper studies community detection in a stylized yet informative inhomogeneous multilayer network model. In our model, layers are generated by different stochastic block models, the community structures of which are (random) perturbations of a common global structure while the connecting probabilities in different layers are not related. Focusing on the symmetric two block case, we establish minimax rates for both global estimation of the common structure and individualized estimation of layer-wise community structures. Both minimax rates have sharp exponents. In addition, we provide an efficient algorithm that is simultaneously asymptotic minimax optimal for both estimation tasks under mild conditions. The optimal rates depend on the parity of the number of most informative layers, a phenomenon that is caused by inhomogeneity across layers. The method is extended to handle multiple and potentially asymmetric community cases. We demonstrate its effectiveness on both simulated examples and a real multi-modal single-cell dataset.
    Learning List-wise Representation in Reinforcement Learning for Ads Allocation with Multiple Auxiliary Tasks. (arXiv:2204.00888v2 [cs.LG] UPDATED)
    With the recent prevalence of reinforcement learning (RL), there have been tremendous interests in utilizing RL for ads allocation in recommendation platforms (e.g., e-commerce and news feed sites). To achieve better allocation, the input of recent RL-based ads allocation methods is upgraded from point-wise single item to list-wise item arrangement. However, this also results in a high-dimensional space of state-action pairs, making it difficult to learn list-wise representations with good generalization ability. This further hinders the exploration of RL agents and causes poor sample efficiency. To address this problem, we propose a novel RL-based approach for ads allocation which learns better list-wise representations by leveraging task-specific signals on Meituan food delivery platform. Specifically, we propose three different auxiliary tasks based on reconstruction, prediction, and contrastive learning respectively according to prior domain knowledge on ads allocation. We conduct extensive experiments on Meituan food delivery platform to evaluate the effectiveness of the proposed auxiliary tasks. Both offline and online experimental results show that the proposed method can learn better list-wise representations and achieve higher revenue for the platform compared to the state-of-the-art baselines.
    Kernel Normalized Convolutional Networks. (arXiv:2205.10089v1 [cs.LG])
    Existing deep convolutional neural network (CNN) architectures frequently rely upon batch normalization (BatchNorm) to effectively train the model. BatchNorm significantly improves model performance, but performs poorly with smaller batch sizes. To address this limitation, we propose kernel normalization and kernel normalized convolutional layers, and incorporate them into kernel normalized convolutional networks (KNConvNets) as the main building blocks. We implement KNConvNets corresponding to the state-of-the-art CNNs such as ResNet and DenseNet while forgoing BatchNorm layers. Through extensive experiments, we illustrate that KNConvNets consistently outperform their batch, group, and layer normalized counterparts in terms of both accuracy and convergence rate while maintaining competitive computational efficiency.
    Edge Rewiring Goes Neural: Boosting Network Resilience without Rich Features. (arXiv:2110.09035v2 [cs.LG] UPDATED)
    Improving the resilience of a network is a fundamental problem in network science, which protects the underlying system from natural disasters and malicious attacks. This is traditionally achieved via successive degree-preserving edge rewiring operations, with the major limitation of being transductive. Inductively solving graph-related tasks with sequential actions is accomplished by adopting graph neural networks (GNNs) coupled with reinforcement learning under the scenario with rich graph features. However, such frameworks cannot be directly applied to resilience tasks where only pure topological structure is available. In this case, GNNs can barely learn useful information, resulting in prohibitive difficulty in making actions for successively rewiring edges under a reinforcement learning context.In this paper, we study in depth the reasons why typical GNNs cause such failure. Based on this investigation, we propose \textbf{ResiNet}, the first end-to-end trainable inductive framework to discover \textbf{Resi}lient \textbf{Net}work topologies while balancing network utility. To this end, we reformulate resilience optimization as an MDP equipped with edge rewiring action space, and propose a pure topology-oriented variant of GNN called \textbf{Fi}lt\textbf{r}ation \textbf{e}nhanced \textbf{G}raph \textbf{N}eural \textbf{N}etwork (\textbf{FireGNN}), which can learn from graphs without rich features. Extensive experiments demonstrate that ResiNet achieves a near-optimal resilience gain on various graphs while balancing the utility, and outperforms existing approaches by a large margin.
    Proposition-Level Clustering for Multi-Document Summarization. (arXiv:2112.08770v2 [cs.CL] UPDATED)
    Text clustering methods were traditionally incorporated into multi-document summarization (MDS) as a means for coping with considerable information repetition. Particularly, clusters were leveraged to indicate information saliency as well as to avoid redundancy. Such prior methods focused on clustering sentences, even though closely related sentences usually contain also non-aligned parts. In this work, we revisit the clustering approach, grouping together sub-sentential propositions, aiming at more precise information alignment. Specifically, our method detects salient propositions, clusters them into paraphrastic clusters, and generates a representative sentence for each cluster via text fusion. Our summarization method improves over the previous state-of-the-art MDS method in the DUC 2004 and TAC 2011 datasets, both in automatic ROUGE scores and human preference.
    Time Series Anomaly Detection via Reinforcement Learning-Based Model Selection. (arXiv:2205.09884v1 [cs.LG])
    Time series anomaly detection is of critical importance for the reliable and efficient operation of real-world systems. Many anomaly detection models have been developed throughout the years based on various assumptions regarding anomaly characteristics. However, due to the complex nature of real-world data, different anomalies within a time series usually have diverse profiles supporting different anomaly assumptions, making it difficult to find a single anomaly detector that can consistently beat all other models. In this work, to harness the benefits of different base models, we assume that a pool of anomaly detection models is accessible and propose to utilize reinforcement learning to dynamically select a candidate model from these base models. Experiments on real-world data have been implemented. It is demonstrated that the proposed strategy can outperforms all baseline models in terms of overall performance.
    Seeking entropy: complex behavior from intrinsic motivation to occupy action-state path space. (arXiv:2205.10316v1 [cs.AI])
    Intrinsic motivation generates behaviors that do not necessarily lead to immediate reward, but help exploration and learning. Here we show that agents having the sole goal of maximizing occupancy of future actions and states, that is, moving and exploring on the long term, are capable of complex behavior without any reference to external rewards. We find that action-state path entropy is the only measure consistent with additivity and other intuitive properties of expected future action-state path occupancy. We provide analytical expressions that relate the optimal policy with the optimal state-value function, from where we prove uniqueness of the solution of the associated Bellman equation and convergence of our algorithm to the optimal state-value function. Using discrete and continuous state tasks, we show that `dancing', hide-and-seek and a basic form of altruistic behavior naturally result from entropy seeking without external rewards. Intrinsically motivated agents can objectively determine what states constitute rewards, exploiting them to ultimately maximize action-state path entropy.
    Concurrent Policy Blending and System Identification for Generalized Assistive Control. (arXiv:2205.09836v1 [cs.RO])
    In this work, we address the problem of solving complex collaborative robotic tasks subject to multiple varying parameters. Our approach combines simultaneous policy blending with system identification to create generalized policies that are robust to changes in system parameters. We employ a blending network whose state space relies solely on parameter estimates from a system identification technique. As a result, this blending network learns how to handle parameter changes instead of trying to learn how to solve the task for a generalized parameter set simultaneously. We demonstrate our scheme's ability on a collaborative robot and human itching task in which the human has motor impairments. We then showcase our approach's efficiency with a variety of system identification techniques when compared to standard domain randomization.
    Converting Artificial Neural Networks to Spiking Neural Networks via Parameter Calibration. (arXiv:2205.10121v1 [cs.NE])
    Spiking Neural Network (SNN), originating from the neural behavior in biology, has been recognized as one of the next-generation neural networks. Conventionally, SNNs can be obtained by converting from pre-trained Artificial Neural Networks (ANNs) by replacing the non-linear activation with spiking neurons without changing the parameters. In this work, we argue that simply copying and pasting the weights of ANN to SNN inevitably results in activation mismatch, especially for ANNs that are trained with batch normalization (BN) layers. To tackle the activation mismatch issue, we first provide a theoretical analysis by decomposing local conversion error to clipping error and flooring error, and then quantitatively measure how this error propagates throughout the layers using the second-order analysis. Motivated by the theoretical results, we propose a set of layer-wise parameter calibration algorithms, which adjusts the parameters to minimize the activation mismatch. Extensive experiments for the proposed algorithms are performed on modern architectures and large-scale tasks including ImageNet classification and MS COCO detection. We demonstrate that our method can handle the SNN conversion with batch normalization layers and effectively preserve the high accuracy even in 32 time steps. For example, our calibration algorithms can increase up to 65% accuracy when converting VGG-16 with BN layers.
    Personalized Federated Learning with Adaptive Batchnorm for Healthcare. (arXiv:2112.00734v3 [cs.LG] UPDATED)
    There is a growing interest in applying machine learning techniques to healthcare. Recently, federated learning (FL) is gaining popularity since it allows researchers to train powerful models without compromising data privacy and security. However, the performance of existing FL approaches often deteriorates when encountering non-iid situations where there exist distribution gaps among clients, and few previous efforts focus on personalization in healthcare. In this article, we propose FedAP to tackle domain shifts and then obtain personalized models for local clients. FedAP learns the similarity between clients based on the statistics of the batch normalization layers while preserving the specificity of each client with different local batch normalization. Comprehensive experiments on five healthcare benchmarks demonstrate that FedAP achieves better accuracy compared to state-of-the-art methods (e.g., 10% accuracy improvement for PAMAP2) with faster convergence speed.
    Lossless Acceleration for Seq2seq Generation with Aggressive Decoding. (arXiv:2205.10350v1 [cs.CL])
    We study lossless acceleration for seq2seq generation with a novel decoding algorithm -- Aggressive Decoding. Unlike the previous efforts (e.g., non-autoregressive decoding) speeding up seq2seq generation at the cost of quality loss, our approach aims to yield the identical (or better) generation compared with autoregressive decoding but in a significant speedup, achieved by innovative cooperation of aggressive decoding and verification that are both efficient due to parallel computing. We propose two Aggressive Decoding paradigms for 2 kinds of seq2seq tasks: 1) For the seq2seq tasks whose inputs and outputs are highly similar (e.g., Grammatical Error Correction), we propose Input-guided Aggressive Decoding (IAD) that aggressively copies from the input sentence as drafted decoded tokens to verify in parallel; 2) For other general seq2seq tasks (e.g., Machine Translation), we propose Generalized Aggressive Decoding (GAD) that first employs an additional non-autoregressive decoding model for aggressive decoding and then verifies in parallel in the autoregressive manner. We test Aggressive Decoding on the most popular 6-layer Transformer model on GPU in multiple seq2seq tasks: 1) For IAD, we show that it can introduce a 7x-9x speedup for the Transformer in Grammatical Error Correction and Text Simplification tasks with the identical results as greedy decoding; 2) For GAD, we observe a 3x-5x speedup with the identical or even better quality in two important seq2seq tasks: Machine Translation and Abstractive Summarization. Moreover, Aggressive Decoding can benefit even more from stronger computing devices that are better at parallel computing. Given the lossless quality as well as significant and promising speedup, we believe Aggressive Decoding may potentially evolve into a de facto standard for efficient and lossless seq2seq generation in the near future.
    Anomaly Detection for Multivariate Time Series on Large-scale Fluid Handling Plant Using Two-stage Autoencoder. (arXiv:2205.09924v1 [cs.LG])
    This paper focuses on anomaly detection for multivariate time series data in large-scale fluid handling plants with dynamic components, such as power generation, water treatment, and chemical plants, where signals from various physical phenomena are observed simultaneously. In these plants, the need for anomaly detection techniques is increasing in order to reduce the cost of operation and maintenance, in view of a decline in the number of skilled engineers and a shortage of manpower. However, considering the complex behavior of high-dimensional signals and the demand for interpretability, the techniques constitute a major challenge. We introduce a Two-Stage AutoEncoder (TSAE) as an anomaly detection method suitable for such plants. This is a simple autoencoder architecture that makes anomaly detection more interpretable and more accurate, in which based on the premise that plant signals can be separated into two behaviors that have almost no correlation with each other, the signals are separated into long-term and short-term components in a stepwise manner, and the two components are trained independently to improve the inference capability for normal signals. Through experiments on two publicly available datasets of water treatment systems, we have confirmed the high detection performance, the validity of the premise, and that the model behavior was as intended, i.e., the technical effectiveness of TSAE.
    SADAM: Stochastic Adam, A Stochastic Operator for First-Order Gradient-based Optimizer. (arXiv:2205.10247v1 [cs.LG])
    In this work, to efficiently help escape the stationary and saddle points, we propose, analyze, and generalize a stochastic strategy performed as an operator for a first-order gradient descent algorithm in order to increase the target accuracy and reduce time consumption. Unlike existing algorithms, the proposed stochastic the strategy does not require any batches and sampling techniques, enabling efficient implementation and maintaining the initial first-order optimizer's convergence rate, but provides an incomparable improvement of target accuracy when optimizing the target functions. In short, the proposed strategy is generalized, applied to Adam, and validated via the decomposition of biomedical signals using Deep Matrix Fitting and another four peer optimizers. The validation results show that the proposed random strategy can be easily generalized for first-order optimizers and efficiently improve the target accuracy.
    Cross DQN: Cross Deep Q Network for Ads Allocation in Feed. (arXiv:2109.04353v4 [cs.LG] UPDATED)
    E-commerce platforms usually display a mixed list of ads and organic items in feed. One key problem is to allocate the limited slots in the feed to maximize the overall revenue as well as improve user experience, which requires a good model for user preference. Instead of modeling the influence of individual items on user behaviors, the arrangement signal models the influence of the arrangement of items and may lead to a better allocation strategy. However, most of previous strategies fail to model such a signal and therefore result in suboptimal performance. In addition, the percentage of ads exposed (PAE) is an important indicator in ads allocation. Excessive PAE hurts user experience while too low PAE reduces platform revenue. Therefore, how to constrain the PAE within a certain range while keeping personalized recommendation under the PAE constraint is a challenge. In this paper, we propose Cross Deep Q Network (Cross DQN) to extract the crucial arrangement signal by crossing the embeddings of different items and modeling the crossed sequence by multi-channel attention. Besides, we propose an auxiliary loss for batch-level constraint on PAE to tackle the above-mentioned challenge. Our model results in higher revenue and better user experience than state-of-the-art baselines in offline experiments. Moreover, our model demonstrates a significant improvement in the online A/B test and has been fully deployed on Meituan feed to serve more than 300 millions of customers.
    Learning Interface Conditions in Domain Decomposition Solvers. (arXiv:2205.09833v1 [cs.LG])
    Domain decomposition methods are widely used and effective in the approximation of solutions to partial differential equations. Yet the optimal construction of these methods requires tedious analysis and is often available only in simplified, structured-grid settings, limiting their use for more complex problems. In this work, we generalize optimized Schwarz domain decomposition methods to unstructured-grid problems, using Graph Convolutional Neural Networks (GCNNs) and unsupervised learning to learn optimal modifications at subdomain interfaces. A key ingredient in our approach is an improved loss function, enabling effective training on relatively small problems, but robust performance on arbitrarily large problems, with computational cost linear in problem size. The performance of the learned linear solvers is compared with both classical and optimized domain decomposition algorithms, for both structured- and unstructured-grid problems.
    On Jointly Optimizing Partial Offloading and SFC Mapping: A Cooperative Dual-agent Deep Reinforcement Learning Approach. (arXiv:2205.09925v1 [cs.AI])
    Multi-access edge computing (MEC) and network function virtualization (NFV) are promising technologies to support emerging IoT applications, especially those computation-intensive. In NFV-enabled MEC environment, service function chain (SFC), i.e., a set of ordered virtual network functions (VNFs), can be mapped on MEC servers. Mobile devices (MDs) can offload computation-intensive applications, which can be represented by SFCs, fully or partially to MEC servers for remote execution. This paper studies the partial offloading and SFC mapping joint optimization (POSMJO) problem in an NFV-enabled MEC system, where an incoming task can be partitioned into two parts, one for local execution and the other for remote execution. The objective is to minimize the average cost in the long term which is a combination of execution delay, MD's energy consumption, and usage charge for edge computing. This problem consists of two closely related decision-making steps, namely task partition and VNF placement, which is highly complex and quite challenging. To address this, we propose a cooperative dual-agent deep reinforcement learning (CDADRL) algorithm, where we design a framework enabling interaction between two agents. Simulation results show that the proposed algorithm outperforms three combinations of deep reinforcement learning algorithms in terms of cumulative and average episodic rewards and it overweighs a number of baseline algorithms with respect to execution delay, energy consumption, and usage charge.
    Almost exact recovery in noisy semi-supervised learning. (arXiv:2007.14717v3 [cs.LG] UPDATED)
    Graph-based semi-supervised learning methods combine the graph structure and labeled data to classify unlabeled data. In this work, we study the effect of a noisy oracle on classification. In particular, we derive the Maximum A Posteriori (MAP) estimator for clustering a Degree Corrected Stochastic Block Model (DC-SBM) when a noisy oracle reveals a fraction of the labels. We then propose an algorithm derived from a continuous relaxation of the MAP, and we establish its consistency. Numerical experiments show that our approach achieves promising performance on synthetic and real data sets, even in the case of very noisy labeled data.
    Long-Range Transformers for Dynamic Spatiotemporal Forecasting. (arXiv:2109.12218v2 [cs.LG] UPDATED)
    Multivariate Time Series Forecasting focuses on the prediction of future values based on historical context. State-of-the-art sequence-to-sequence models rely on neural attention between timesteps, which allows for temporal learning but fails to consider distinct spatial relationships between variables. In contrast, methods based on graph neural networks explicitly model variable relationships. However, these methods often rely on predefined graphs and perform separate spatial and temporal updates without establishing direct connections between each variable at every timestep. This paper addresses these problems by translating multivariate forecasting into a spatiotemporal sequence formulation where each Transformer input token represents the value of a single variable at a given time. Long-Range Transformers can then learn interactions between space, time, and value information jointly along this extended sequence. Our method, which we call Spacetimeformer, achieves competitive results on benchmarks from traffic forecasting to electricity demand and weather prediction while learning fully-connected spatiotemporal relationships purely from data.
    EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction. (arXiv:2202.05146v3 [q-bio.BM] UPDATED)
    Predicting how a drug-like molecule binds to a specific protein target is a core problem in drug discovery. An extremely fast computational binding method would enable key applications such as fast virtual screening or drug engineering. Existing methods are computationally expensive as they rely on heavy candidate sampling coupled with scoring, ranking, and fine-tuning steps. We challenge this paradigm with EquiBind, an SE(3)-equivariant geometric deep learning model performing direct-shot prediction of both i) the receptor binding location (blind docking) and ii) the ligand's bound pose and orientation. EquiBind achieves significant speed-ups and better quality compared to traditional and recent baselines. Further, we show extra improvements when coupling it with existing fine-tuning techniques at the cost of increased running time. Finally, we propose a novel and fast fine-tuning model that adjusts torsion angles of a ligand's rotatable bonds based on closed-form global minima of the von Mises angular distance to a given input atomic point cloud, avoiding previous expensive differential evolution strategies for energy minimization.
    Graph Representation Learning for Multi-Task Settings: a Meta-Learning Approach. (arXiv:2201.03326v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have become the state-of-the-art method for many applications on graph structured data. GNNs are a model for graph representation learning, which aims at learning to generate low dimensional node embeddings that encapsulate structural and feature-related information. GNNs are usually trained in an end-to-end fashion, leading to highly specialized node embeddings. While this approach achieves great results in the single-task setting, the generation of node embeddings that can be used to perform multiple tasks (with performance comparable to single-task models) is still an open problem. We propose the use of meta-learning to allow the training of a GNN model capable of producing multi-task node embeddings. In particular, we exploit the properties of optimization-based meta-learning to learn GNNs that can produce general node representations by learning parameters that can quickly (i.e. with a few steps of gradient descent) adapt to multiple tasks. Our experiments show that the embeddings produced by a model trained with our purposely designed meta-learning procedure can be used to perform multiple tasks with comparable or, surprisingly, even higher performance than both single-task and multi-task end-to-end models.
    MCMARL: Parameterizing Value Function via Mixture of Categorical Distributions for Multi-Agent Reinforcement Learning. (arXiv:2202.10134v2 [cs.LG] UPDATED)
    In cooperative multi-agent tasks, a team of agents jointly interact with an environment by taking actions, receiving a team reward and observing the next state. During the interactions, the uncertainty of environment and reward will inevitably induce stochasticity in the long-term returns and the randomness can be exacerbated with the increasing number of agents. However, such randomness is ignored by most of the existing value-based multi-agent reinforcement learning (MARL) methods, which only model the expectation of Q-value for both individual agents and the team. Compared to using the expectations of the long-term returns, it is preferable to directly model the stochasticity by estimating the returns through distributions. With this motivation, this work proposes a novel value-based MARL framework from a distributional perspective, \emph{i.e.}, parameterizing value function via \underline{M}ixture of \underline{C}ategorical distributions for MARL. Specifically, we model both individual Q-values and global Q-value with categorical distribution. To integrate categorical distributions, we define five basic operations on the distribution, which allow the generalization of expected value function factorization methods (\emph{e.g.}, VDN and QMIX) to their MCMARL variants. We further prove that our MCMARL framework satisfies \emph{Distributional-Individual-Global-Max} (DIGM) principle with respect to the expectation of distribution, which guarantees the consistency between joint and individual greedy action selections in the global Q-value and individual Q-values. Empirically, we evaluate MCMARL on both a stochastic matrix game and a challenging set of StarCraft II micromanagement tasks, showing the efficacy of our framework.
    Revisiting GANs by Best-Response Constraint: Perspective, Methodology, and Application. (arXiv:2205.10146v1 [cs.LG])
    In past years, the minimax type single-level optimization formulation and its variations have been widely utilized to address Generative Adversarial Networks (GANs). Unfortunately, it has been proved that these alternating learning strategies cannot exactly reveal the intrinsic relationship between the generator and discriminator, thus easily result in a series of issues, including mode collapse, vanishing gradients and oscillations in the training phase, etc. In this work, by investigating the fundamental mechanism of GANs from the perspective of hierarchical optimization, we propose Best-Response Constraint (BRC), a general learning framework, that can explicitly formulate the potential dependency of the generator on the discriminator. Rather than adopting these existing time-consuming bilevel iterations, we design an implicit gradient scheme with outer-product Hessian approximation as our fast solution strategy. \emph{Noteworthy, we demonstrate that even with different motivations and formulations, a variety of existing GANs ALL can be uniformly improved by our flexible BRC methodology.} Extensive quantitative and qualitative experimental results verify the effectiveness, flexibility and stability of our proposed framework.
    Towards the Generation of Synthetic Images of Palm Vein Patterns: A Review. (arXiv:2205.10179v1 [cs.CV])
    With the recent success of computer vision and deep learning, remarkable progress has been achieved on automatic personal recognition using vein biometrics. However, collecting large-scale real-world training data for palm vein recognition has turned out to be challenging, mainly due to the noise and irregular variations included at the time of acquisition. Meanwhile, existing palm vein recognition datasets are usually collected under near-infrared light, lacking detailed annotations on attributes (e.g., pose), so the influences of different attributes on vein recognition have been poorly investigated. Therefore, this paper examines the suitability of synthetic vein images generated to compensate for the urgent lack of publicly available large-scale datasets. Firstly, we present an overview of recent research progress on palm vein recognition, from the basic background knowledge to vein anatomical structure, data acquisition, public database, and quality assessment procedures. Then, we focus on the state-of-the-art methods that have allowed the generation of vascular structures for biometric purposes and the modeling of biological networks with their respective application domains. In addition, we review the existing research on the generation of style transfer and biological nature-based synthetic palm vein image algorithms. Afterward, we formalize a general flowchart for the creation of a synthetic database comparing real palm vein images and generated synthetic samples to obtain some understanding into the development of the realistic vein imaging system. Ultimately, we conclude by discussing the challenges, insights, and future perspectives in generating synthetic palm vein images for further works.
    A Computational Framework of Cortical Microcircuits Approximates Sign-concordant Random Backpropagation. (arXiv:2205.07292v2 [cs.NE] UPDATED)
    Several recent studies attempt to address the biological implausibility of the well-known backpropagation (BP) method. While promising methods such as feedback alignment, direct feedback alignment, and their variants like sign-concordant feedback alignment tackle BP's weight transport problem, their validity remains controversial owing to a set of other unsolved issues. In this work, we answer the question of whether it is possible to realize random backpropagation solely based on mechanisms observed in neuroscience. We propose a hypothetical framework consisting of a new microcircuit architecture and its supporting Hebbian learning rules. Comprising three types of cells and two types of synaptic connectivity, the proposed microcircuit architecture computes and propagates error signals through local feedback connections and supports the training of multi-layered spiking neural networks with a globally defined spiking error function. We employ the Hebbian rule operating in local compartments to update synaptic weights and achieve supervised learning in a biologically plausible manner. Finally, we interpret the proposed framework from an optimization point of view and show its equivalence to sign-concordant feedback alignment. The proposed framework is benchmarked on several datasets including MNIST and CIFAR10, demonstrating promising BP-comparable accuracy.
    Evaluating the Faithfulness of Importance Measures in NLP by Recursively Masking Allegedly Important Tokens and Retraining. (arXiv:2110.08412v2 [cs.CL] UPDATED)
    To explain NLP models, importance measures such as attention inform which inputs tokens are important for a prediction are popular. However, an open question is how well these explanations accurately reflect a model's logic, a property called faithfulness. To answer this question, we propose an new faithfulness benchmark called Recursive ROAR. This works by recursively masking allegedly important tokens and then retrain the model. The principle is, that this should result in worse model performance compared to masking random tokens. The result is a performance curve given a masking-ratio. Furthermore, we propose a summarizing metric using the area-between-curves, which allows for easy comparison across papers, models, and tasks. To provide a thorough review, we evaluate 4 different importance measures on 8 different datasets, using both LSTM-attention models and RoBERTa models. We find that the faithfulness of importance measures is both model-dependent and task-dependent. This conclusion contradicts previous evaluations in both computer vision and faithfulness of attention literature.
    AutoFedNLP: An efficient FedNLP framework. (arXiv:2205.10162v1 [cs.LG])
    Transformer-based pre-trained models have revolutionized NLP for superior performance and generality. Fine-tuning pre-trained models for downstream tasks often require private data, for which federated learning is the de-facto approach (i.e., FedNLP). However, our measurements show that FedNLP is prohibitively slow due to the large model sizes and the resultant high network/computation cost. Towards practical FedNLP, we identify as the key building blocks adapters, small bottleneck modules inserted at a variety of model layers. A key challenge is to properly configure the depth and width of adapters, to which the training speed and efficiency is highly sensitive. No silver-bullet configuration exists: the optimal choice varies across downstream NLP tasks, desired model accuracy, and client resources. A silver-bullet configuration does not exist and a non-optimal configuration could significantly slow down the training. To automate adapter configuration, we propose AutoFedNLP, a framework that enhances the existing FedNLP with two novel designs. First, AutoFedNLP progressively upgrades the adapter configuration throughout a training session. Second, AutoFedNLP continuously profiles future adapter configurations by allocating participant devices to trial groups. To minimize client-side computations, AutoFedNLP exploits the fact that a FedNLP client trains on the same samples repeatedly between consecutive changes of adapter configurations, and caches computed activations on clients. Extensive experiments show that AutoFedNLP can reduce FedNLP's model convergence delay to no more than several hours, which is up to 155.5$\times$ faster compared to vanilla FedNLP and 48$\times$ faster compared to strong baselines.
    Learning Convolutional Neural Networks in the Frequency Domain. (arXiv:2204.06718v8 [cs.CV] UPDATED)
    Convolutional neural network (CNN) has achieved impressive success in computer vision during the past few decades. The image convolution operation helps CNNs to get good performance on image-related tasks. However, the image convolution has high computation complexity and hard to be implemented. This paper proposes the CEMNet, which can be trained in the frequency domain. The most important motivation of this research is that we can use the straightforward element-wise multiplication operation to replace the image convolution in the frequency domain based on the Cross-Correlation Theorem, which obviously reduces the computation complexity. We further introduce a Weight Fixation mechanism to alleviate the problem of over-fitting, and analyze the working behavior of Batch Normalization, Leaky ReLU, and Dropout in the frequency domain to design their counterparts for CEMNet. Also, to deal with complex inputs brought by Discrete Fourier Transform, we design a two-branches network structure for CEMNet. Experimental results imply that CEMNet achieves good performance on MNIST and CIFAR-10 databases.
    The developmental trajectory of object recognition robustness: children are like small adults but unlike big deep neural networks. (arXiv:2205.10144v1 [cs.CV])
    In laboratory object recognition tasks based on undistorted photographs, both adult humans and Deep Neural Networks (DNNs) perform close to ceiling. Unlike adults', whose object recognition performance is robust against a wide range of image distortions, DNNs trained on standard ImageNet (1.3M images) perform poorly on distorted images. However, the last two years have seen impressive gains in DNN distortion robustness, predominantly achieved through ever-increasing large-scale datasets$\unicode{x2014}$orders of magnitude larger than ImageNet. While this simple brute-force approach is very effective in achieving human-level robustness in DNNs, it raises the question of whether human robustness, too, is simply due to extensive experience with (distorted) visual input during childhood and beyond. Here we investigate this question by comparing the core object recognition performance of 146 children (aged 4$\unicode{x2013}$15) against adults and against DNNs. We find, first, that already 4$\unicode{x2013}$6 year-olds showed remarkable robustness to image distortions and outperform DNNs trained on ImageNet. Second, we estimated the number of $\unicode{x201C}$images$\unicode{x201D}$ children have been exposed to during their lifetime. Compared to various DNNs, children's high robustness requires relatively little data. Third, when recognizing objects children$\unicode{x2014}$like adults but unlike DNNs$\unicode{x2014}$rely heavily on shape but not on texture cues. Together our results suggest that the remarkable robustness to distortions emerges early in the developmental trajectory of human object recognition and is unlikely the result of a mere accumulation of experience with distorted visual input. Even though current DNNs match human performance regarding robustness they seem to rely on different and more data-hungry strategies to do so.
    Optimizing the Communication-Accuracy Trade-off in Federated Learning with Rate-Distortion Theory. (arXiv:2201.02664v3 [cs.LG] UPDATED)
    A significant bottleneck in federated learning (FL) is the network communication cost of sending model updates from client devices to the central server. We present a comprehensive empirical study of the statistics of model updates in FL, as well as the role and benefits of various compression techniques. Motivated by these observations, we propose a novel method to reduce the average communication cost, which is near-optimal in many use cases, and outperforms Top-K, DRIVE, 3LC and QSGD on Stack Overflow next-word prediction, a realistic and challenging FL benchmark. This is achieved by examining the problem using rate-distortion theory, and proposing distortion as a reliable proxy for model accuracy. Distortion can be more effectively used for optimizing the trade-off between model performance and communication cost across clients. We demonstrate empirically that in spite of the non-i.i.d. nature of federated learning, the rate-distortion frontier is consistent across datasets, optimizers, clients and training rounds.
    Missing Data Imputation and Acquisition with Deep Hierarchical Models and Hamiltonian Monte Carlo. (arXiv:2202.04599v2 [cs.LG] UPDATED)
    Variational Autoencoders (VAEs) have recently been highly successful at imputing and acquiring heterogeneous missing data. However, within this specific application domain, existing VAE methods are restricted by using only one layer of latent variables and strictly Gaussian posterior approximations. To address these limitations, we present HH-VAEM, a Hierarchical VAE model for mixed-type incomplete data that uses Hamiltonian Monte Carlo with automatic hyper-parameter tuning for improved approximate inference. Our experiments show that HH-VAEM outperforms existing baselines in the tasks of missing data imputation and supervised learning with missing features. Finally, we also present a sampling-based approach for efficiently computing the information gain when missing features are to be acquired with HH-VAEM. Our experiments show that this sampling-based approach is superior to alternatives based on Gaussian approximations.
    Task Relabelling for Multi-task Transfer using Successor Features. (arXiv:2205.10175v1 [cs.AI])
    Deep Reinforcement Learning has been very successful recently with various works on complex domains. Most works are concerned with learning a single policy that solves the target task, but is fixed in the sense that if the environment changes the agent is unable to adapt to it. Successor Features (SFs) proposes a mechanism that allows learning policies that are not tied to any particular reward function. In this work we investigate how SFs may be pre-trained without observing any reward in a custom environment that features resource collection, traps and crafting. After pre-training we expose the SF agents to various target tasks and see how well they can transfer to new tasks. Transferring is done without any further training on the SF agents, instead just by providing a task vector. For training the SFs we propose a task relabelling method which greatly improves the agent's performance.
    The Fairness of Credit Scoring Models. (arXiv:2205.10200v1 [stat.ML])
    In credit markets, screening algorithms aim to discriminate between good-type and bad-type borrowers. However, when doing so, they also often discriminate between individuals sharing a protected attribute (e.g. gender, age, racial origin) and the rest of the population. In this paper, we show how (1) to test whether there exists a statistically significant difference between protected and unprotected groups, which we call lack of fairness and (2) to identify the variables that cause the lack of fairness. We then use these variables to optimize the fairness-performance trade-off. Our framework provides guidance on how algorithmic fairness can be monitored by lenders, controlled by their regulators, and improved for the benefit of protected groups.
    Do Transformer Models Show Similar Attention Patterns to Task-Specific Human Gaze?. (arXiv:2205.10226v1 [cs.CL])
    Learned self-attention functions in state-of-the-art NLP models often correlate with human attention. We investigate whether self-attention in large-scale pre-trained language models is as predictive of human eye fixation patterns during task-reading as classical cognitive models of human attention. We compare attention functions across two task-specific reading datasets for sentiment analysis and relation extraction. We find the predictiveness of large-scale pre-trained self-attention for human attention depends on `what is in the tail', e.g., the syntactic nature of rare contexts. Further, we observe that task-specific fine-tuning does not increase the correlation with human task-specific reading. Through an input reduction experiment we give complementary insights on the sparsity and fidelity trade-off, showing that lower-entropy attention vectors are more faithful.
    A Novel Weighted Ensemble Learning Based Agent for the Werewolf Game. (arXiv:2205.09813v1 [cs.LG])
    Werewolf is a popular party game throughout the world, and research on its significance has progressed in recent years. The Werewolf game is based on conversation, and in order to win, participants must use all of their cognitive abilities. This communication game requires the playing agents to be very sophisticated to win. In this research, we generated a sophisticated agent to play the Werewolf game using a complex weighted ensemble learning approach. This research work aimed to estimate what other agents/players think of us in the game. The agent was developed by aggregating strategies of different participants in the AI Wolf competition and thereby learning from them using machine learning. Moreover, the agent created was able to perform much better than other competitors using very basic strategies to show the approach's effectiveness in the Werewolf game. The machine learning technique used here is not restricted to the Werewolf game but may be extended to any game that requires communication and action depending on other participants.
    Linearizing Transformer with Key-Value Memory. (arXiv:2203.12644v3 [cs.CL] UPDATED)
    Efficient transformer variants with linear time complexity have been developed to mitigate the quadratic computational overhead of the vanilla transformer. Among them are low-rank projection methods such as Linformer and kernel-based Transformers. Despite their unique merits, they usually suffer from a performance drop comparing with the vanilla transformer on many sequence generation tasks, and often fail to obtain computation gain when the generation is short. We propose MemSizer, an approach towards closing the performance gap while improving the efficiency even with short generation. It projects the source sequences into lower dimension representations like Linformer, while enjoying efficient recurrent-style incremental computation similar to kernel-based transformers. This yields linear computation time and constant memory complexity at inference time. MemSizer also employs a lightweight multi-head mechanism which renders the computation as light as a single-head model. We demonstrate that MemSizer provides an improved balance between efficiency and accuracy over the vanilla transformer and other efficient transformer variants in three typical sequence generation tasks, including machine translation, abstractive text summarization, and language modeling.
    GNN-Geo: A Graph Neural Network-based Fine-grained IP geolocation Framework. (arXiv:2112.10767v6 [cs.LG] UPDATED)
    Rule-based fine-grained IP geolocation methods are hard to generalize in computer networks which do not follow hypothetical rules. Recently, deep learning methods, like multi-layer perceptron (MLP), are tried to increase generalization capabilities. However, MLP is not so suitable for graph-structured data like networks. MLP treats IP addresses as isolated instances and ignores the connection information, which limits geolocation accuracy. In this work, we research how to increase the generalization capability with an emerging graph deep learning method - Graph Neural Network (GNN). First, IP geolocation is re-formulated as an attributed graph node regression problem. Then, we propose a GNN-based IP geolocation framework named GNN-Geo. GNN-Geo consists of a preprocessor, an encoder, messaging passing (MP) layers and a decoder. The preprocessor and encoder transform measurement data into the initial node embeddings. MP layers refine the initial node embeddings by modeling the connection information. The decoder maps the refined embeddings to nodes' locations and relieves the convergence problem of GNN by considering prior knowledge. The experiments in different real-world datasets show: the proposed GNN-Geo outperforms the state-of-art rule-based and learning-based baselines on all datasets w.r.t median error distance by 16% to 28%. This work verifies the great potential of GNN for fine-grained IP geolocation.
    Interpretable Personalization via Policy Learning with Linear Decision Boundaries. (arXiv:2003.07545v3 [cs.LG] UPDATED)
    With the rise of the digital economy and an explosion of available information on consumers, effective personalization of offers, goods, and services has become a core business focus for companies to improve revenues and maintain competitive edge. This paper studies the personalization problem through the lens of policy learning, where the goal is to learn a decision-making rule (a policy) that maps from consumer and product characteristics (features) to recommendations (actions) in order to optimize outcomes (rewards). We focus on using available historical data for offline learning with unknown data collection procedure. Importantly, in many business and medical settings, interpretability of a policy is essential. To address these challenges, we study the class of policies with linear decision boundaries and propose learning algorithms using tools from causal inference. We propose several optimization schemes to solve the associated non-convex, non-smooth optimization problem, and find that an adapted Bayesian optimization algorithm is fast and effective. We test our algorithm with extensive simulation studies and apply it to an online marketplace customer purchase dataset, where the learned policy outputs a personalized discount recommendation based on customer and product features in order to maximize gross merchandise value (GMV) for sellers. Our learned policy improves upon the platform's baseline by 88.2\% in net sales revenue, while also providing informative insights on which features are important for the decision-making process, e.g. when "Attribute 2" is large, marginal increase in GMV is low for discounts higher than 10\%. Our findings suggest that the proposed policy learning algorithm provides a promising practical approach for interpretable personalization across a wide range of applications.
    Counterfactual Temporal Point Processes. (arXiv:2111.07603v2 [cs.LG] UPDATED)
    Machine learning models based on temporal point processes are the state of the art in a wide variety of applications involving discrete events in continuous time. However, these models lack the ability to answer counterfactual questions, which are increasingly relevant as these models are being used to inform targeted interventions. In this work, our goal is to fill this gap. To this end, we first develop a causal model of thinning for temporal point processes that builds upon the Gumbel-Max structural causal model. This model satisfies a desirable counterfactual monotonicity condition, which is sufficient to identify counterfactual dynamics in the process of thinning. Then, given an observed realization of a temporal point process with a given intensity function, we develop a sampling algorithm that uses the above causal model of thinning and the superposition theorem to simulate counterfactual realizations of the temporal point process under a given alternative intensity function. Simulation experiments using synthetic and real epidemiological data show that the counterfactual realizations provided by our algorithm may give valuable insights to enhance targeted interventions.
    On Algorithmic Stability in Unsupervised Representation Learning. (arXiv:2106.05238v3 [cs.LG] UPDATED)
    In this paper, we investigate the algorithmic stability of unsupervised representation learning with deep generative models, as a function of repeated re-training on the same input data. Algorithms for learning low dimensional linear representations -- for example principal components analysis (PCA), or linear independent components analysis (ICA) -- come with guarantees that they will always reveal the same latent representations (perhaps up to an arbitrary rotation or permutation). Unfortunately, for non-linear representation learning, such as in a variational auto-encoder (VAE) model trained by stochastic gradient descent, we have no such guarantees. Recent work on identifiability in non-linear ICA have introduced a family of deep generative models that have identifiable latent representations, achieved by conditioning on side information (e.g. informative labels). We empirically evaluate the stability of these models under repeated re-estimation of parameters, and compare them to both standard VAEs and deep generative models which learn to cluster in their latent space. Surprisingly, we discover side information is not necessary for algorithmic stability: using standard quantitative measures of identifiability, we find deep generative models with latent clusterings are empirically identifiable to the same degree as models which rely on auxiliary labels. We relate these results to the possibility of identifiable non-linear ICA.
    CertiFair: A Framework for Certified Global Fairness of Neural Networks. (arXiv:2205.09927v1 [cs.LG])
    We consider the problem of whether a Neural Network (NN) model satisfies global individual fairness. Individual Fairness suggests that similar individuals with respect to a certain task are to be treated similarly by the decision model. In this work, we have two main objectives. The first is to construct a verifier which checks whether the fairness property holds for a given NN in a classification task or provide a counterexample if it is violated, i.e., the model is fair if all similar individuals are classified the same, and unfair if a pair of similar individuals are classified differently. To that end, We construct a sound and complete verifier that verifies global individual fairness properties of ReLU NN classifiers using distance-based similarity metrics. The second objective of this paper is to provide a method for training provably fair NN classifiers from unfair (biased) data. We propose a fairness loss that can be used during training to enforce fair outcomes for similar individuals. We then provide provable bounds on the fairness of the resulting NN. We run experiments on commonly used fairness datasets that are publicly available and we show that global individual fairness can be improved by 96 % without significant drop in test accuracy.
    Unintended memorisation of unique features in neural networks. (arXiv:2205.10079v1 [cs.LG])
    Neural networks pose a privacy risk due to their propensity to memorise and leak training data. We show that unique features occurring only once in training data are memorised by discriminative multi-layer perceptrons and convolutional neural networks trained on benchmark imaging datasets. We design our method for settings where sensitive training data is not available, for example medical imaging. Our setting knows the unique feature, but not the training data, model weights or the unique feature's label. We develop a score estimating a model's sensitivity to a unique feature by comparing the KL divergences of the model's output distributions given modified out-of-distribution images. We find that typical strategies to prevent overfitting do not prevent unique feature memorisation. And that images containing a unique feature are highly influential, regardless of the influence the images's other features. We also find a significant variation in memorisation with training seed. These results imply that neural networks pose a privacy risk to rarely occurring private information. This risk is more pronounced in healthcare applications since sensitive patient information can be memorised when it remains in training data due to an imperfect data sanitisation process.
    Unsupervised Out-of-Domain Detection via Pre-trained Transformers. (arXiv:2106.00948v2 [cs.CL] UPDATED)
    Deployed real-world machine learning applications are often subject to uncontrolled and even potentially malicious inputs. Those out-of-domain inputs can lead to unpredictable outputs and sometimes catastrophic safety issues. Prior studies on out-of-domain detection require in-domain task labels and are limited to supervised classification scenarios. Our work tackles the problem of detecting out-of-domain samples with only unsupervised in-domain data. We utilize the latent representations of pre-trained transformers and propose a simple yet effective method to transform features across all layers to construct out-of-domain detectors efficiently. Two domain-specific fine-tuning approaches are further proposed to boost detection accuracy. Our empirical evaluations of related methods on two datasets validate that our method greatly improves out-of-domain detection ability in a more general scenario.
    You Don't Know My Favorite Color: Preventing Dialogue Representations from Revealing Speakers' Private Personas. (arXiv:2205.10228v1 [cs.CL])
    Social chatbots, also known as chit-chat chatbots, evolve rapidly with large pretrained language models. Despite the huge progress, privacy concerns have arisen recently: training data of large language models can be extracted via model inversion attacks. On the other hand, the datasets used for training chatbots contain many private conversations between two individuals. In this work, we further investigate the privacy leakage of the hidden states of chatbots trained by language modeling which has not been well studied yet. We show that speakers' personas can be inferred through a simple neural network with high accuracy. To this end, we propose effective defense objectives to protect persona leakage from hidden states. We conduct extensive experiments to demonstrate that our proposed defense objectives can greatly reduce the attack accuracy from 37.6% to 0.5%. Meanwhile, the proposed objectives preserve language models' powerful generation ability.
    Policies for the Dynamic Traveling Maintainer Problem with Alerts. (arXiv:2105.15119v2 [math.OC] UPDATED)
    Downtime of industrial assets such as wind turbines and medical imaging devices comes at a sharp cost. To avoid such downtime costs, companies seek to initiate maintenance just before failure. Unfortunately, this is challenging for the following two reasons: On the one hand, because asset failures are notoriously difficult to predict, even in the presence of real-time monitoring devices which signal early degradation. On the other hand, because the available resources to serve a network of geographically dispersed assets are typically limited. In this paper, we propose a novel dynamic traveling maintainer problem with alerts model that incorporates these two challenges and we provide three solution approaches on how to dispatch the limited resources. Namely, we propose: (i) Greedy heuristic approaches that rank assets on urgency, proximity and economic risk; (ii) A novel traveling maintainer heuristic approach that optimizes short-term costs; and (iii) A deep reinforcement learning (DRL) approach that optimizes long-term costs. Each approach has different requirements concerning the available alert information. Experiments with small asset networks show that all methods can approximate the optimal policy when given access to complete condition information. For larger networks, the proposed methods yield competitive policies, with DRL consistently achieving the lowest costs.
    A Unified Experiment Design Approach for Cyclic and Acyclic Causal Models. (arXiv:2205.10083v1 [cs.LG])
    We study experiment design for the unique identification of the causal graph of a system where the graph may contain cycles. The presence of cycles in the structure introduces major challenges for experiment design. Unlike the case of acyclic graphs, learning the skeleton of the causal graph from observational distribution may not be possible. Furthermore, intervening on a variable does not necessarily lead to orienting all the edges incident to it. In this paper, we propose an experiment design approach that can learn both cyclic and acyclic graphs and hence, unifies the task of experiment design for both types of graphs. We provide a lower bound on the number of experiments required to guarantee the unique identification of the causal graph in the worst case, showing that the proposed approach is order-optimal in terms of the number of experiments up to an additive logarithmic term. Moreover, we extend our result to the setting where the size of each experiment is bounded by a constant. For this case, we show that our approach is optimal in terms of the size of the largest experiment required for the unique identification of the causal graph in the worst case.
    Confident Clustering via PCA Compression Ratio and Its Application to Single-cell RNA-seq Analysis. (arXiv:2205.09849v1 [cs.LG])
    Unsupervised clustering algorithms for vectors has been widely used in the area of machine learning. Many applications, including the biological data we studied in this paper, contain some boundary datapoints which show combination properties of two underlying clusters and could lower the performance of the traditional clustering algorithms. We develop a confident clustering method aiming to diminish the influence of these datapoints and improve the clustering results. Concretely, for a list of datapoints, we give two clustering results. The first-round clustering attempts to classify only pure vectors with high confidence. Based on it, we classify more vectors with less confidence in the second round. We validate our algorithm on single-cell RNA-seq data, which is a powerful and widely used tool in biology area. Our confident clustering shows a high accuracy on our tested datasets. In addition, unlike traditional clustering methods in single-cell analysis, the confident clustering shows high stability under different choices of parameters.
    How to Guide Adaptive Depth Sampling?. (arXiv:2205.10202v1 [cs.CV])
    Recent advances in depth sensing technologies allow fast electronic maneuvering of the laser beam, as opposed to fixed mechanical rotations. This will enable future sensors, in principle, to vary in real-time the sampling pattern. We examine here the abstract problem of whether adapting the sampling pattern for a given frame can reduce the reconstruction error or allow a sparser pattern. We propose a constructive generic method to guide adaptive depth sampling algorithms. Given a sampling budget B, a depth predictor P and a desired quality measure M, we propose an Importance Map that highlights important sampling locations. This map is defined for a given frame as the per-pixel expected value of M produced by the predictor P, given a pattern of B random samples. This map can be well estimated in a training phase. We show that a neural network can learn to produce a highly faithful Importance Map, given an RGB image. We then suggest an algorithm to produce a sampling pattern for the scene, which is denser in regions that are harder to reconstruct. The sampling strategy of our modular framework can be adjusted according to hardware limitations, type of depth predictor, and any custom reconstruction error measure that should be minimized. We validate through simulations that our approach outperforms grid and random sampling patterns as well as recent state-of-the-art adaptive algorithms.
    Generalisation effects of predictive uncertainty estimation in deep learning for digital pathology. (arXiv:2112.09693v2 [cs.LG] UPDATED)
    Deep learning (DL) has shown great potential in digital pathology applications. The robustness of a diagnostic DL-based solution is essential for safe clinical deployment. In this work we evaluate if adding uncertainty estimates for DL predictions in digital pathology could result in increased value for the clinical applications, by boosting the general predictive performance or by detecting mispredictions. We compare the effectiveness of model-integrated methods (MC dropout and Deep ensembles) with a model-agnostic approach (Test time augmentation, TTA). Moreover, four uncertainty metrics are compared. Our experiments focus on two domain shift scenarios: a shift to a different medical center and to an underrepresented subtype of cancer. Our results show that uncertainty estimates increase reliability by reducing a model's sensitivity to classification threshold selection as well as by detecting between 70\% and 90\% of the mispredictions done by the model. Overall, the deep ensembles method achieved the best performance closely followed by TTA.
    Remember and Forget Experience Replay for Multi-Agent Reinforcement Learning. (arXiv:2203.13319v2 [cs.LG] UPDATED)
    We present the extension of the Remember and Forget for Experience Replay (ReF-ER) algorithm to Multi-Agent Reinforcement Learning (MARL). {ReF-ER} was shown to outperform state of the art algorithms for continuous control in problems ranging from the OpenAI Gym to complex fluid flows. In MARL, the dependencies between the agents are included in the state-value estimator and the environment dynamics are modeled via the importance weights used by ReF-ER. In collaborative environments, we find the best performance when the value is estimated using individual rewards and we ignore the effects of other actions on the transition map. We benchmark the performance of ReF-ER MARL on the Stanford Intelligent Systems Laboratory (SISL) environments. We find that employing a single feed-forward neural network for the policy and the value function in ReF-ER MARL, outperforms state of the art algorithms that rely on complex neural network architectures.
    Improving Multi-Task Generalization via Regularizing Spurious Correlation. (arXiv:2205.09797v1 [cs.LG])
    Multi-Task Learning (MTL) is a powerful learning paradigm to improve generalization performance via knowledge sharing. However, existing studies find that MTL could sometimes hurt generalization, especially when two tasks are less correlated. One possible reason that hurts generalization is spurious correlation, i.e., some knowledge is spurious and not causally related to task labels, but the model could mistakenly utilize them and thus fail when such correlation changes. In MTL setup, there exist several unique challenges of spurious correlation. First, the risk of having non-causal knowledge is higher, as the shared MTL model needs to encode all knowledge from different tasks, and causal knowledge for one task could be potentially spurious to the other. Second, the confounder between task labels brings in a different type of spurious correlation to MTL. We theoretically prove that MTL is more prone to taking non-causal knowledge from other tasks than single-task learning, and thus generalize worse. To solve this problem, we propose Multi-Task Causal Representation Learning framework, aiming to represent multi-task knowledge via disentangled neural modules, and learn which module is causally related to each task via MTL-specific invariant regularization. Experiments show that it could enhance MTL model's performance by 5.5% on average over Multi-MNIST, MovieLens, Taskonomy, CityScape, and NYUv2, via alleviating spurious correlation problem.
    Continual learning on 3D point clouds with random compressed rehearsal. (arXiv:2205.08013v2 [cs.LG] UPDATED)
    Contemporary deep neural networks offer state-of-the-art results when applied to visual reasoning, e.g., in the context of 3D point cloud data. Point clouds are important datatype for precise modeling of three-dimensional environments, but effective processing of this type of data proves to be challenging. In the world of large, heavily-parameterized network architectures and continuously-streamed data, there is an increasing need for machine learning models that can be trained on additional data. Unfortunately, currently available models cannot fully leverage training on additional data without losing their past knowledge. Combating this phenomenon, called catastrophic forgetting, is one of the main objectives of continual learning. Continual learning for deep neural networks has been an active field of research, primarily in 2D computer vision, natural language processing, reinforcement learning, and robotics. However, in 3D computer vision, there are hardly any continual learning solutions specifically designed to take advantage of point cloud structure. This work proposes a novel neural network architecture capable of continual learning on 3D point cloud data. We utilize point cloud structure properties for preserving a heavily compressed set of past data. By using rehearsal and reconstruction as regularization methods of the learning process, our approach achieves a significant decrease of catastrophic forgetting compared to the existing solutions on several most popular point cloud datasets considering two continual learning settings: when a task is known beforehand, and in the challenging scenario of when task information is unknown to the model.
    Translating Hanja historical documents to understandable Korean and English. (arXiv:2205.10019v1 [cs.CL])
    The Annals of Joseon Dynasty (AJD) contain the daily records of the Kings of Joseon, the 500-year kingdom preceding the modern nation of Korea. The Annals were originally written in an archaic Korean writing system, `Hanja', and translated into Korean from 1968 to 1993. However, this translation was literal and contained many archaic Korean words; thus, a new expert translation effort began in 2012, completing the records of only one king in a decade. Also, expert translators are working on an English translation, of which only one king's records are available because of the high cost and slow progress. Thus, we propose H2KE, the neural machine translation model that translates Hanja historical documents to understandable Korean and English. Based on the multilingual neural machine translation approach, it translates the historical document written in Hanja, using both the full dataset of outdated Korean translation and a small dataset of recently translated Korean and English. We compare our method with two baselines: one is a recent model that simultaneously learns to restore and translate Hanja historical document and the other is the transformer that trained on newly translated corpora only. The results show that our method significantly outperforms the baselines in terms of BLEU score in both modern Korean and English translations. We also conduct a human evaluation that shows that our translation is preferred over the original expert translation.
    A Case of Exponential Convergence Rates for SVM. (arXiv:2205.10055v1 [stat.ML])
    Classification is often the first problem described in introductory machine learning classes. Generalization guarantees of classification have historically been offered by Vapnik-Chervonenkis theory. Yet those guarantees are based on intractable algorithms, which has led to the theory of surrogate methods in classification. Guarantees offered by surrogate methods are based on calibration inequalities, which have been shown to be highly sub-optimal under some margin conditions, failing short to capture exponential convergence phenomena. Those "super" fast rates are becoming to be well understood for smooth surrogates, but the picture remains blurry for non-smooth losses such as the hinge loss, associated with the renowned support vector machines. In this paper, we present a simple mechanism to obtain fast convergence rates and we investigate its usage for SVM. In particular, we show that SVM can exhibit exponential convergence rates even without assuming the hard Tsybakov margin condition.
    Semi-self-supervised Automated ICD Coding. (arXiv:2205.10088v1 [cs.CL])
    Clinical Text Notes (CTNs) contain physicians' reasoning process, written in an unstructured free text format, as they examine and interview patients. In recent years, several studies have been published that provide evidence for the utility of machine learning for predicting doctors' diagnoses from CTNs, a task known as ICD coding. Data annotation is time consuming, particularly when a degree of specialization is needed, as is the case for medical data. This paper presents a method of augmenting a sparsely annotated dataset of Icelandic CTNs with a machine-learned imputation in a semi-self-supervised manner. We train a neural network on a small set of annotated CTNs and use it to extract clinical features from a set of un-annotated CTNs. These clinical features consist of answers to about a thousand potential questions that a physician might find the answers to during a consultation of a patient. The features are then used to train a classifier for the diagnosis of certain types of diseases. We report the results of an evaluation of this data augmentation method over three tiers of data availability to the physician. Our data augmentation method shows a significant positive effect which is diminished when clinical features from the examination of the patient and diagnostics are made available. We recommend our method for augmenting scarce datasets for systems that take decisions based on clinical features that do not include examinations or tests.
    Deep reinforcement learning under signal temporal logic constraints using Lagrangian relaxation. (arXiv:2201.08504v3 [stat.ML] UPDATED)
    Deep reinforcement learning (DRL) has attracted much attention as an approach to solve sequential decision making problems without mathematical models of systems or environments. In general, a constraint may be imposed on a decision making. In this study, we consider the optimal decision making problems with constraints to complete temporal high-level tasks in the continuous state-action domain. We describe the constraints using signal temporal logic (STL), which is useful for time sensitive control tasks since it can specify continuous signals within a bounded time interval. To deal with the STL constraints, we introduce an extended constrained Markov decision process (CMDP), which is called a $\tau$-CMDP. We formulate the STL constrained optimal decision making problem as the $\tau$-CMDP and propose a two-phase constrained DRL algorithm using the Lagrangian relaxation method. Through simulations, we also demonstrate the learning performance of the proposed algorithm.
    Towards biologically plausible Dreaming and Planning. (arXiv:2205.10044v1 [cs.LG])
    Humans and animals can learn new skills after practicing for a few hours, while current reinforcement learning algorithms require a large amount of data to achieve good performances. Recent model-based approaches show promising results by reducing the number of necessary interactions with the environment to learn a desirable policy. However, these methods require biological implausible ingredients, such as the detailed storage of older experiences, and long periods of offline learning. The optimal way to learn and exploit word-models is still an open question. Taking inspiration from biology, we suggest that dreaming might be an efficient expedient to use an inner model. We propose a two-module (agent and model) neural network in which "dreaming" (living new experiences in a model-based simulated environment) significantly boosts learning. We also explore "planning", an online alternative to dreaming, that shows comparable performances. Importantly, our model does not require the detailed storage of experiences, and learns online the world-model. This is a key ingredient for biological plausibility and implementability (e.g., in neuromorphic hardware). Our network is composed of spiking neurons, further increasing the energetic efficiency and the plausibility of the model. To our knowledge, there are no previous works proposing biologically plausible model-based reinforcement learning in recurrent spiking networks. Our work is a step toward building efficient neuromorphic systems for autonomous robots, capable of learning new skills in real-world environments. Even when the environment is no longer accessible, the robot optimizes learning by "reasoning" in its own "mind". These approaches are of great relevance when the acquisition from the environment is slow, expensive (robotics) or unsafe (autonomous driving).
    Adversarial Sample Detection for Speaker Verification by Neural Vocoders. (arXiv:2107.00309v4 [cs.SD] UPDATED)
    Automatic speaker verification (ASV), one of the most important technology for biometric identification, has been widely adopted in security-critical applications. However, ASV is seriously vulnerable to recently emerged adversarial attacks, yet effective countermeasures against them are limited. In this paper, we adopt neural vocoders to spot adversarial samples for ASV. We use the neural vocoder to re-synthesize audio and find that the difference between the ASV scores for the original and re-synthesized audio is a good indicator for discrimination between genuine and adversarial samples. This effort is, to the best of our knowledge, among the first to pursue such a technical direction for detecting time-domain adversarial samples for ASV, and hence there is a lack of established baselines for comparison. Consequently, we implement the Griffin-Lim algorithm as the detection baseline. The proposed approach achieves effective detection performance that outperforms the baselines in all the settings. We also show that the neural vocoder adopted in the detection framework is dataset-independent. Our codes will be made open-source for future works to do fair comparison.
    User Localization using RF Sensing: A Performance comparison between LIS and mmWave Radars. (arXiv:2205.10321v1 [eess.SP])
    Since electromagnetic signals are omnipresent, Radio Frequency (RF)-sensing has the potential to become a universal sensing mechanism with applications in localization, smart-home, retail, gesture recognition, intrusion detection, etc. Two emerging technologies in RF-sensing, namely sensing through Large Intelligent Surfaces (LISs) and mmWave Frequency-Modulated Continuous-Wave (FMCW) radars, have been successfully applied to a wide range of applications. In this work, we compare LIS and mmWave radars for localization in real-world and simulated environments. In our experiments, the mmWave radar achieves 0.71 Intersection Over Union (IOU) and 3cm error for bounding boxes, while LIS has 0.56 IOU and 10cm distance error. Although the radar outperforms the LIS in terms of accuracy, LIS features additional applications in communication in addition to sensing scenarios.
    Visual Concepts Tokenization. (arXiv:2205.10093v1 [cs.CV])
    Obtaining the human-like perception ability of abstracting visual concepts from concrete pixels has always been a fundamental and important target in machine learning research fields such as disentangled representation learning and scene decomposition. Towards this goal, we propose an unsupervised transformer-based Visual Concepts Tokenization framework, dubbed VCT, to perceive an image into a set of disentangled visual concept tokens, with each concept token responding to one type of independent visual concept. Particularly, to obtain these concept tokens, we only use cross-attention to extract visual information from the image tokens layer by layer without self-attention between concept tokens, preventing information leakage across concept tokens. We further propose a Concept Disentangling Loss to facilitate that different concept tokens represent independent visual concepts. The cross-attention and disentangling loss play the role of induction and mutual exclusion for the concept tokens, respectively. Extensive experiments on several popular datasets verify the effectiveness of VCT on the tasks of disentangled representation learning and scene decomposition. VCT achieves the state of the art results by a large margin.
    The Fellowship of the Dyson Ring: ACT\&Friends' Results and Methods for GTOC 11. (arXiv:2205.10124v1 [cs.AI])
    Dyson spheres are hypothetical megastructures encircling stars in order to harvest most of their energy output. During the 11th edition of the GTOC challenge, participants were tasked with a complex trajectory planning related to the construction of a precursor Dyson structure, a heliocentric ring made of twelve stations. To this purpose, we developed several new approaches that synthesize techniques from machine learning, combinatorial optimization, planning and scheduling, and evolutionary optimization effectively integrated into a fully automated pipeline. These include a machine learned transfer time estimator, improving the established Edelbaum approximation and thus better informing a Lazy Race Tree Search to identify and collect asteroids with high arrival mass for the stations; a series of optimally-phased low-thrust transfers to all stations computed by indirect optimization techniques, exploiting the synodic periodicity of the system; and a modified Hungarian scheduling algorithm, which utilizes evolutionary techniques to arrange a mass-balanced arrival schedule out of all transfer possibilities. We describe the steps of our pipeline in detail with a special focus on how our approaches mutually benefit from each other. Lastly, we outline and analyze the final solution of our team, ACT&Friends, which ranked second at the GTOC 11 challenge.
    Lifelong Neural Predictive Coding: Learning Cumulatively Online without Forgetting. (arXiv:1905.10696v3 [cs.LG] UPDATED)
    In lifelong learning systems based on artificial neural networks, one of the biggest obstacles is the inability to retain old knowledge as new information is encountered. This phenomenon is known as catastrophic forgetting. In this paper, we propose a new kind of connectionist architecture, the Sequential Neural Coding Network, that is robust to forgetting when learning from streams of data points and, unlike networks of today, does not learn via the popular back-propagation of errors. Grounded in the neurocognitive theory of predictive processing, our model adapts synapses in a biologically-plausible fashion while another neural system learns to direct and control this cortex-like structure by mimicking some of task-executive control functionality of the basal ganglia. In our experiments, we demonstrate that our self-organizing system experiences significantly less forgetting compared to standard neural models, outperforming a swath of previously proposed methods, including rehearsal/data buffer-based methods, on both standard (SplitMNIST, Split Fashion MNIST, etc.) and custom benchmarks even though it is trained in a stream-like fashion. Our work offers evidence that emulating mechanisms in real neuronal systems, e.g., local learning, lateral competition, can yield new directions for tackling the grand challenge of lifelong machine learning.
    Nonlinear Initialization Methods for Low-Rank Neural Networks. (arXiv:2202.00834v3 [cs.LG] UPDATED)
    We propose a novel low-rank initialization framework for training low-rank deep neural networks -- networks where the weight parameters are re-parameterized by products of two low-rank matrices. The most successful prior existing approach, spectral initialization, draws a sample from the initialization distribution for the full-rank setting and then optimally approximates the full-rank initialization parameters in the Frobenius norm with a pair of low-rank initialization matrices via singular value decomposition. Our method is inspired by the insight that approximating the function corresponding to each layer is more important than approximating the parameter values. We provably demonstrate that there is a significant gap between these two approaches for ReLU networks, particularly as the desired rank of the approximating weights decreases, or as the dimension of the inputs to the layer increases (the latter point holds when the network width is super-linear in dimension). Along the way, we provide the first provably efficient algorithm for solving the ReLU low-rank approximation problem for fixed parameter rank $r$ -- previously, it was unknown that the problem was computationally tractable to solve even for rank $1$. We also provide a practical algorithm to solve this problem which is no more expensive than the existing spectral initialization approach, and validate our theory by training ResNet and EfficientNet models (He et al., 2016; Tan & Le, 2019) on ImageNet (Russakovsky et al., 2015).
    Exploring the Trade-off between Plausibility, Change Intensity and Adversarial Power in Counterfactual Explanations using Multi-objective Optimization. (arXiv:2205.10232v1 [cs.LG])
    There is a broad consensus on the importance of deep learning models in tasks involving complex data. Often, an adequate understanding of these models is required when focusing on the transparency of decisions in human-critical applications. Besides other explainability techniques, trustworthiness can be achieved by using counterfactuals, like the way a human becomes familiar with an unknown process: by understanding the hypothetical circumstances under which the output changes. In this work we argue that automated counterfactual generation should regard several aspects of the produced adversarial instances, not only their adversarial capability. To this end, we present a novel framework for the generation of counterfactual examples which formulates its goal as a multi-objective optimization problem balancing three different objectives: 1) plausibility, i.e., the likeliness of the counterfactual of being possible as per the distribution of the input data; 2) intensity of the changes to the original input; and 3) adversarial power, namely, the variability of the model's output induced by the counterfactual. The framework departs from a target model to be audited and uses a Generative Adversarial Network to model the distribution of input data, together with a multi-objective solver for the discovery of counterfactuals balancing among these objectives. The utility of the framework is showcased over six classification tasks comprising image and three-dimensional data. The experiments verify that the framework unveils counterfactuals that comply with intuition, increasing the trustworthiness of the user, and leading to further insights, such as the detection of bias and data misrepresentation.
    Pre-Train Your Loss: Easy Bayesian Transfer Learning with Informative Priors. (arXiv:2205.10279v1 [cs.LG])
    Deep learning is increasingly moving towards a transfer learning paradigm whereby large foundation models are fine-tuned on downstream tasks, starting from an initialization learned on the source task. But an initialization contains relatively little information about the source task. Instead, we show that we can learn highly informative posteriors from the source task, through supervised or self-supervised approaches, which then serve as the basis for priors that modify the whole loss surface on the downstream task. This simple modular approach enables significant performance gains and more data-efficient learning on a variety of downstream classification and segmentation tasks, serving as a drop-in replacement for standard pre-training strategies. These highly informative priors also can be saved for future use, similar to pre-trained weights, and stand in contrast to the zero-mean isotropic uninformative priors that are typically used in Bayesian deep learning.
    Heterformer: A Transformer Architecture for Node Representation Learning on Heterogeneous Text-Rich Networks. (arXiv:2205.10282v1 [cs.CL])
    We study node representation learning on heterogeneous text-rich networks, where nodes and edges are multi-typed and some types of nodes are associated with text information. Although recent studies on graph neural networks (GNNs) and pretrained language models (PLMs) have demonstrated their power in encoding network and text signals, respectively, less focus has been given to delicately coupling these two types of models on heterogeneous text-rich networks. Specifically, existing GNNs rarely model text in each node in a contextualized way; existing PLMs can hardly be applied to characterize graph structures due to their sequence architecture. In this paper, we propose Heterformer, a Heterogeneous GNN-nested transformer that blends GNNs and PLMs into a unified model. Different from previous "cascaded architectures" that directly add GNN layers upon a PLM, our Heterformer alternately stacks two modules - a graph-attention-based neighbor aggregation module and a transformer-based text and neighbor joint encoding module - to facilitate thorough mutual enhancement between network and text signals. Meanwhile, Heterformer is capable of characterizing network heterogeneity and nodes without text information. Comprehensive experiments on three large-scale datasets from different domains demonstrate the superiority of Heterformer over state-of-the-art baselines in link prediction, transductive/inductive node classification, node clustering, and semantics-based retrieval.
    Set-based Meta-Interpolation for Few-Task Meta-Learning. (arXiv:2205.09990v1 [cs.LG])
    Meta-learning approaches enable machine learning systems to adapt to new tasks given few examples by leveraging knowledge from related tasks. However, a large number of meta-training tasks are still required for generalization to unseen tasks during meta-testing, which introduces a critical bottleneck for real-world problems that come with only few tasks, due to various reasons including the difficulty and cost of constructing tasks. Recently, several task augmentation methods have been proposed to tackle this issue using domain-specific knowledge to design augmentation techniques to densify the meta-training task distribution. However, such reliance on domain-specific knowledge renders these methods inapplicable to other domains. While Manifold Mixup based task augmentation methods are domain-agnostic, we empirically find them ineffective on non-image domains. To tackle these limitations, we propose a novel domain-agnostic task augmentation method, Meta-Interpolation, which utilizes expressive neural set functions to densify the meta-training task distribution using bilevel optimization. We empirically validate the efficacy of Meta-Interpolation on eight datasets spanning across various domains such as image classification, molecule property prediction, text classification and speech recognition. Experimentally, we show that Meta-Interpolation consistently outperforms all the relevant baselines. Theoretically, we prove that task interpolation with the set function regularizes the meta-learner to improve generalization.
    A Proximal Algorithm for Sampling from Non-convex Potentials. (arXiv:2205.10188v1 [cs.LG])
    We study sampling problems associated with non-convex potentials that meanwhile lack smoothness. In particular, we consider target distributions that satisfy either logarithmic-Sobolev inequality or Poincar\'e inequality. Rather than smooth, the potentials are assumed to be semi-smooth or the summation of multiple semi-smooth functions. We develop a sampling algorithm that resembles proximal algorithms in optimization for this challenging sampling task. Our algorithm is based on a special case of Gibbs sampling known as the alternating sampling framework (ASF). The key contribution of this work is a practical realization of the ASF based on rejection sampling in the non-convex and semi-smooth setting. This work extends the recent algorithm in \cite{LiaChe21,LiaChe22} for non-smooth/semi-smooth log-concave distribution to the setting with non-convex potentials. In almost all the cases of sampling considered in this work, our proximal sampling algorithm achieves better complexity than all existing methods.
    Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation. (arXiv:2205.09853v1 [cs.CV])
    Video prediction is a challenging task. The quality of video frames from current state-of-the-art (SOTA) generative models tends to be poor and generalization beyond the training data is difficult. Furthermore, existing prediction frameworks are typically not capable of simultaneously handling other video-related tasks such as unconditional generation or interpolation. In this work, we devise a general-purpose framework called Masked Conditional Video Diffusion (MCVD) for all of these video synthesis tasks using a probabilistic conditional score-based denoising diffusion model, conditioned on past and/or future frames. We train the model in a manner where we randomly and independently mask all the past frames or all the future frames. This novel but straightforward setup allows us to train a single model that is capable of executing a broad range of video tasks, specifically: future/past prediction -- when only future/past frames are masked; unconditional generation -- when both past and future frames are masked; and interpolation -- when neither past nor future frames are masked. Our experiments show that this approach can generate high-quality frames for diverse types of videos. Our MCVD models are built from simple non-recurrent 2D-convolutional architectures, conditioning on blocks of frames and generating blocks of frames. We generate videos of arbitrary lengths autoregressively in a block-wise manner. Our approach yields SOTA results across standard video prediction and interpolation benchmarks, with computation times for training models measured in 1-12 days using $\le$ 4 GPUs. https://mask-cond-video-diffusion.github.io
    A New Feature Selection Method for LogNNet and its Application for Diagnosis and Prognosis of COVID-19 Disease Using Routine Blood Values. (arXiv:2205.09974v1 [cs.LG])
    Since February-2020, the world has embarked on an intense struggle with the COVID-19 disease, and health systems have come under a tragic pressure as the disease turned into a pandemic. The aim of this study is to determine the most effective routine-blood-values (RBV) in the diagnosis/prognosis of COVID-19 using new feature selection method for LogNNet reservoir neural network. First dataset in this study consists of a total of 5296-patients with a same number of negative and positive covid test. Second dataset consists of a total of 3899-patients with a diagnosis of COVID-19, who were treated in hospital with severe-infected (203) and mildly-infected (3696). The most important RBVs that affect the diagnosis of the disease from the first dataset were mean-corpuscular-hemoglobin-concentration (MCHC), mean-corpuscular-hemoglobin (MCH) and activated-partial-prothrombin-time (aPTT). The most effective features in the prognosis of the disease were erythrocyte-sedimentation-rate (ESR), neutrophil-count (NEU), C-reactive-protein (CRP). LogNNet-model achieved an accuracy rate of A46 = 99.5% in the diagnosis of the disease with 46 features and A3 = 99.17% with only MCHC, MCH, and aPTT features. Model reached an accuracy rate of A48 = 94.4% in determining the prognosis of the disease with 48 features and A3 = 82.7% with only ESR, NEU, and CRP features. LogNNet model demonstrated a very high disease diagnosis/prognosis of COVID-19 performance without knowing about the symptoms or history of the patients. The model is suitable for devices with low resources (3-14 kB of RAM used on the Arduino microcontroller), and is promising to create mobile health monitoring systems in the Internet of Things. Our method will reduce the negative pressures on the health sector and help doctors understand pathogenesis of COVID-19 through key futures and contribute positively to the treatment processes.
    The price of ignorance: how much does it cost to forget noise structure in low-rank matrix estimation?. (arXiv:2205.10009v1 [cs.IT])
    We consider the problem of estimating a rank-1 signal corrupted by structured rotationally invariant noise, and address the following question: how well do inference algorithms perform when the noise statistics is unknown and hence Gaussian noise is assumed? While the matched Bayes-optimal setting with unstructured noise is well understood, the analysis of this mismatched problem is only at its premises. In this paper, we make a step towards understanding the effect of the strong source of mismatch which is the noise statistics. Our main technical contribution is the rigorous analysis of a Bayes estimator and of an approximate message passing (AMP) algorithm, both of which incorrectly assume a Gaussian setup. The first result exploits the theory of spherical integrals and of low-rank matrix perturbations; the idea behind the second one is to design and analyze an artificial AMP which, by taking advantage of the flexibility in the denoisers, is able to "correct" the mismatch. Armed with these sharp asymptotic characterizations, we unveil a rich and often unexpected phenomenology. For example, despite AMP is in principle designed to efficiently compute the Bayes estimator, the former is outperformed by the latter in terms of mean-square error. We show that this performance gap is due to an incorrect estimation of the signal norm. In fact, when the SNR is large enough, the overlaps of the AMP and the Bayes estimator coincide, and they even match those of optimal estimators taking into account the structure of the noise.
    Lifelong Personal Context Recognition. (arXiv:2205.10123v1 [cs.AI])
    We focus on the development of AIs which live in lifelong symbiosis with a human. The key prerequisite for this task is that the AI understands - at any moment in time - the personal situational context that the human is in. We outline the key challenges that this task brings forth, namely (i) handling the human-like and ego-centric nature of the the user's context, necessary for understanding and providing useful suggestions, (ii) performing lifelong context recognition using machine learning in a way that is robust to change, and (iii) maintaining alignment between the AI's and human's representations of the world through continual bidirectional interaction. In this short paper, we summarize our recent attempts at tackling these challenges, discuss the lessons learned, and highlight directions of future research. The main take-away message is that pursuing this project requires research which lies at the intersection of knowledge representation and machine learning. Neither technology can achieve this goal without the other.
    Machine Learning for Combinatorial Optimisation of Partially-Specified Problems: Regret Minimisation as a Unifying Lens. (arXiv:2205.10157v1 [cs.LG])
    It is increasingly common to solve combinatorial optimisation problems that are partially-specified. We survey the case where the objective function or the relations between variables are not known or are only partially specified. The challenge is to learn them from available data, while taking into account a set of hard constraints that a solution must satisfy, and that solving the optimisation problem (esp. during learning) is computationally very demanding. This paper overviews four seemingly unrelated approaches, that can each be viewed as learning the objective function of a hard combinatorial optimisation problem: 1) surrogate-based optimisation, 2) empirical model learning, 3) decision-focused learning (`predict + optimise'), and 4) structured-output prediction. We formalise each learning paradigm, at first in the ways commonly found in the literature, and then bring the formalisations together in a compatible way using regret. We discuss the differences and interactions between these frameworks, highlight the opportunities for cross-fertilization and survey open directions.
    Triangulation candidates for Bayesian optimization. (arXiv:2112.07457v2 [stat.CO] UPDATED)
    Bayesian optimization involves "inner optimization" over a new-data acquisition criterion which is non-convex/highly multi-modal, may be non-differentiable, or may otherwise thwart local numerical optimizers. In such cases it is common to replace continuous search with a discrete one over random candidates. Here we propose using candidates based on a Delaunay triangulation of the existing input design. We detail the construction of these "tricands" and demonstrate empirically how they outperform both numerically optimized acquisitions and random candidate-based alternatives, and are well-suited for hybrid schemes, on benchmark synthetic and real simulation experiments.
    Greedy structure learning from data that contain systematic missing values. (arXiv:2107.04184v3 [cs.LG] UPDATED)
    Learning from data that contain missing values represents a common phenomenon in many domains. Relatively few Bayesian Network structure learning algorithms account for missing data, and those that do tend to rely on standard approaches that assume missing data are missing at random, such as the Expectation-Maximisation algorithm. Because missing data are often systematic, there is a need for more pragmatic methods that can effectively deal with data sets containing missing values not missing at random. The absence of approaches that deal with systematic missing data impedes the application of BN structure learning methods to real-world problems where missingness are not random. This paper describes three variants of greedy search structure learning that utilise pairwise deletion and inverse probability weighting to maximally leverage the observed data and to limit potential bias caused by missing values. The first two of the variants can be viewed as sub-versions of the third and best performing variant, but are important in their own in illustrating the successive improvements in learning accuracy. The empirical investigations show that the proposed approach outperforms the commonly used and state-of-the-art Structural EM algorithm, both in terms of learning accuracy and efficiency, as well as both when data are missing at random and not at random.
    Robot Learning of Mobile Manipulation with Reachability Behavior Priors. (arXiv:2203.04051v2 [cs.RO] UPDATED)
    Mobile Manipulation (MM) systems are ideal candidates for taking up the role of a personal assistant in unstructured real-world environments. Among other challenges, MM requires effective coordination of the robot's embodiments for executing tasks that require both mobility and manipulation. Reinforcement Learning (RL) holds the promise of endowing robots with adaptive behaviors, but most methods require prohibitively large amounts of data for learning a useful control policy. In this work, we study the integration of robotic reachability priors in actor-critic RL methods for accelerating the learning of MM for reaching and fetching tasks. Namely, we consider the problem of optimal base placement and the subsequent decision of whether to activate the arm for reaching a 6D target. For this, we devise a novel Hybrid RL method that handles discrete and continuous actions jointly, resorting to the Gumbel-Softmax reparameterization. Next, we train a reachability prior using data from the operational robot workspace, inspired by classical methods. Subsequently, we derive Boosted Hybrid RL (BHyRL), a novel algorithm for learning Q-functions by modeling them as a sum of residual approximators. Every time a new task needs to be learned, we can transfer our learned residuals and learn the component of the Q-function that is task-specific, hence, maintaining the task structure from prior behaviors. Moreover, we find that regularizing the target policy with a prior policy yields more expressive behaviors. We evaluate our method in simulation in reaching and fetching tasks of increasing difficulty, and we show the superior performance of BHyRL against baseline methods. Finally, we zero-transfer our learned 6D fetching policy with BHyRL to our MM robot TIAGo++. For more details and code release, please refer to our project site: irosalab.com/rlmmbp
    Test-time Batch Normalization. (arXiv:2205.10210v1 [cs.LG])
    Deep neural networks often suffer the data distribution shift between training and testing, and the batch statistics are observed to reflect the shift. In this paper, targeting of alleviating distribution shift in test time, we revisit the batch normalization (BN) in the training process and reveals two key insights benefiting test-time optimization: $(i)$ preserving the same gradient backpropagation form as training, and $(ii)$ using dataset-level statistics for robust optimization and inference. Based on the two insights, we propose a novel test-time BN layer design, GpreBN, which is optimized during testing by minimizing Entropy loss. We verify the effectiveness of our method on two typical settings with distribution shift, i.e., domain generalization and robustness tasks. Our GpreBN significantly improves the test-time performance and achieves the state of the art results.
    DEMAND: Deep Matrix Approximately NonlinearDecomposition to Identify Meta, Canonical, and Sub-Spatial Pattern of functional Magnetic Resonance Imaging in the Human Brain. (arXiv:2205.10264v1 [cs.LG])
    Deep Neural Networks (DNNs) have already become a crucial computational approach to revealing the spatial patterns in the human brain; however, there are three major shortcomings in utilizing DNNs to detect the spatial patterns in functional Magnetic Resonance Signals: 1). It is a fully connected architecture that increases the complexity of network structures that is difficult to optimize and vulnerable to overfitting; 2). The requirement of large training samples results in erasing the individual/minor patterns in feature extraction; 3). The hyperparameters are required to be tuned manually, which is time-consuming. Therefore, we propose a novel deep nonlinear matrix factorization named Deep Matrix Approximately Nonlinear Decomposition (DEMAND) in this work to take advantage of the shallow linear model, e.g., Sparse Dictionary Learning (SDL) and DNNs. At first, the proposed DEMAND employs a non-fully connected and multilayer-stacked architecture that is easier to be optimized compared with canonical DNNs; furthermore, due to the efficient architecture, training DEMAND can avoid overfitting and enables the recognition of individual/minor features based on a small dataset such as an individual data; finally, a novel rank estimator technique is introduced to tune all hyperparameters of DEMAND automatically. Moreover, the proposed DEMAND is validated by four other peer methodologies via real functional Magnetic Resonance Imaging data in the human brain. In short, the validation results demonstrate that DEMAND can reveal the reproducible meta, canonical, and sub-spatial features of the human brain more efficiently than other peer methodologies.
    Planning with Diffusion for Flexible Behavior Synthesis. (arXiv:2205.09991v1 [cs.LG])
    Model-based reinforcement learning methods often use learning only for the purpose of estimating an approximate dynamics model, offloading the rest of the decision-making work to classical trajectory optimizers. While conceptually simple, this combination has a number of empirical shortcomings, suggesting that learned models may not be well-suited to standard trajectory optimization. In this paper, we consider what it would look like to fold as much of the trajectory optimization pipeline as possible into the modeling problem, such that sampling from the model and planning with it become nearly identical. The core of our technical approach lies in a diffusion probabilistic model that plans by iteratively denoising trajectories. We show how classifier-guided sampling and image inpainting can be reinterpreted as coherent planning strategies, explore the unusual and useful properties of diffusion-based planning methods, and demonstrate the effectiveness of our framework in control settings that emphasize long-horizon decision-making and test-time flexibility.
    Leveraging Relational Information for Learning Weakly Disentangled Representations. (arXiv:2205.10056v1 [cs.LG])
    Disentanglement is a difficult property to enforce in neural representations. This might be due, in part, to a formalization of the disentanglement problem that focuses too heavily on separating relevant factors of variation of the data in single isolated dimensions of the neural representation. We argue that such a definition might be too restrictive and not necessarily beneficial in terms of downstream tasks. In this work, we present an alternative view over learning (weakly) disentangled representations, which leverages concepts from relational learning. We identify the regions of the latent space that correspond to specific instances of generative factors, and we learn the relationships among these regions in order to perform controlled changes to the latent codes. We also introduce a compound generative model that implements such a weak disentanglement approach. Our experiments shows that the learned representations can separate the relevant factors of variation in the data, while preserving the information needed for effectively generating high quality data samples.
    Explainable Supervised Domain Adaptation. (arXiv:2205.09943v1 [cs.LG])
    Domain adaptation techniques have contributed to the success of deep learning. Leveraging knowledge from an auxiliary source domain for learning in labeled data-scarce target domain is fundamental to domain adaptation. While these techniques result in increasing accuracy, the adaptation process, particularly the knowledge leveraged from the source domain, remains unclear. This paper proposes an explainable by design supervised domain adaptation framework - XSDA-Net. We integrate a case-based reasoning mechanism into the XSDA-Net to explain the prediction of a test instance in terms of similar-looking regions in the source and target train images. We empirically demonstrate the utility of the proposed framework by curating the domain adaptation settings on datasets popularly known to exhibit part-based explainability.
    Towards Understanding Grokking: An Effective Theory of Representation Learning. (arXiv:2205.10343v1 [cs.LG])
    We aim to understand grokking, a phenomenon where models generalize long after overfitting their training set. We present both a microscopic analysis anchored by an effective theory and a macroscopic analysis of phase diagrams describing learning performance across hyperparameters. We find that generalization originates from structured representations whose training dynamics and dependence on training set size can be predicted by our effective theory in a toy setting. We observe empirically the presence of four learning phases: comprehension, grokking, memorization, and confusion. We find representation learning to occur only in a "Goldilocks zone" (including comprehension and grokking) between memorization and confusion. Compared to the comprehension phase, the grokking phase stays closer to the memorization phase, leading to delayed generalization. The Goldilocks phase is reminiscent of "intelligence from starvation" in Darwinian evolution, where resource limitations drive discovery of more efficient solutions. This study not only provides intuitive explanations of the origin of grokking, but also highlights the usefulness of physics-inspired tools, e.g., effective theories and phase diagrams, for understanding deep learning.
    Constructive Interpretability with CoLabel: Corroborative Integration, Complementary Features, and Collaborative Learning. (arXiv:2205.10011v1 [cs.CV])
    Machine learning models with explainable predictions are increasingly sought after, especially for real-world, mission-critical applications that require bias detection and risk mitigation. Inherent interpretability, where a model is designed from the ground-up for interpretability, provides intuitive insights and transparent explanations on model prediction and performance. In this paper, we present CoLabel, an approach to build interpretable models with explanations rooted in the ground truth. We demonstrate CoLabel in a vehicle feature extraction application in the context of vehicle make-model recognition (VMMR). CoLabel performs VMMR with a composite of interpretable features such as vehicle color, type, and make, all based on interpretable annotations of the ground truth labels. First, CoLabel performs corroborative integration to join multiple datasets that each have a subset of desired annotations of color, type, and make. Then, CoLabel uses decomposable branches to extract complementary features corresponding to desired annotations. Finally, CoLabel fuses them together for final predictions. During feature fusion, CoLabel harmonizes complementary branches so that VMMR features are compatible with each other and can be projected to the same semantic space for classification. With inherent interpretability, CoLabel achieves superior performance to the state-of-the-art black-box models, with accuracy of 0.98, 0.95, and 0.94 on CompCars, Cars196, and BoxCars116K, respectively. CoLabel provides intuitive explanations due to constructive interpretability, and subsequently achieves high accuracy and usability in mission-critical situations.
    Characteristic Neural Ordinary Differential Equations. (arXiv:2111.13207v3 [cs.LG] UPDATED)
    We propose Characteristic-Neural Ordinary Differential Equations (C-NODEs), a framework for extending Neural Ordinary Differential Equations (NODEs) beyond ODEs. While NODEs model the evolution of a latent variables as the solution to an ODE, C-NODE models the evolution of the latent variables as the solution of a family of first-order quasi-linear partial differential equations (PDEs) along curves on which the PDEs reduce to ODEs, referred to as characteristic curves. This in turn allows the application of the standard frameworks for solving ODEs, namely the adjoint method. Learning optimal characteristic curves for given tasks improves the performance and computational efficiency, compared to state of the art NODE models. We prove that the C-NODE framework extends the classical NODE on classification tasks by demonstrating explicit C-NODE representable functions not expressible by NODEs. Additionally, we present C-NODE-based continuous normalizing flows, which describe the density evolution of latent variables along multiple dimensions. Empirical results demonstrate the improvements provided by the proposed method for classification and density estimation on CIFAR-10, SVHN, and MNIST datasets under a similar computational budget as the existing NODE methods. The results also provide empirical evidence that the learned curves improve the efficiency of the system through a lower number of parameters and function evaluations compared with baselines.
    Towards Extremely Fast Bilevel Optimization with Self-governed Convergence Guarantees. (arXiv:2205.10054v1 [math.OC])
    Gradient methods have become mainstream techniques for Bi-Level Optimization (BLO) in learning and vision fields. The validity of existing works heavily relies on solving a series of approximation subproblems with extraordinarily high accuracy. Unfortunately, to achieve the approximation accuracy requires executing a large quantity of time-consuming iterations and computational burden is naturally caused. This paper is thus devoted to address this critical computational issue. In particular, we propose a single-level formulation to uniformly understand existing explicit and implicit Gradient-based BLOs (GBLOs). This together with our designed counter-example can clearly illustrate the fundamental numerical and theoretical issues of GBLOs and their naive accelerations. By introducing the dual multipliers as a new variable, we then establish Bilevel Alternating Gradient with Dual Correction (BAGDC), a general framework, which significantly accelerates different categories of existing methods by taking specific settings. A striking feature of our convergence result is that, compared to those original unaccelerated GBLO versions, the fast BAGDC admits a unified non-asymptotic convergence theory towards stationarity. A variety of numerical experiments have also been conducted to demonstrate the superiority of the proposed algorithmic framework.
    Minimal Explanations for Neural Network Predictions. (arXiv:2205.09901v1 [cs.LG])
    Explaining neural network predictions is known to be a challenging problem. In this paper, we propose a novel approach which can be effectively exploited, either in isolation or in combination with other methods, to enhance the interpretability of neural model predictions. For a given input to a trained neural model, our aim is to compute a smallest set of input features so that the model prediction changes when these features are disregarded by setting them to an uninformative baseline value. While computing such minimal explanations is computationally intractable in general for fully-connected neural networks, we show that the problem becomes solvable in polynomial time by a greedy algorithm under mild assumptions on the network's activation functions. We then show that our tractability result extends seamlessly to more advanced neural architectures such as convolutional and graph neural networks. We conduct experiments to showcase the capability of our method for identifying the input features that are essential to the model's prediction.
    Evolutionary Multi-Armed Bandits with Genetic Thompson Sampling. (arXiv:2205.10113v1 [cs.NE])
    As two popular schools of machine learning, online learning and evolutionary computations have become two important driving forces behind real-world decision making engines for applications in biomedicine, economics, and engineering fields. Although there are prior work that utilizes bandits to improve evolutionary algorithms' optimization process, it remains a field of blank on how evolutionary approach can help improve the sequential decision making tasks of online learning agents such as the multi-armed bandits. In this work, we propose the Genetic Thompson Sampling, a bandit algorithm that keeps a population of agents and update them with genetic principles such as elite selection, crossover and mutations. Empirical results in multi-armed bandit simulation environments and a practical epidemic control problem suggest that by incorporating the genetic algorithm into the bandit algorithm, our method significantly outperforms the baselines in nonstationary settings. Lastly, we introduce EvoBandit, a web-based interactive visualization to guide the readers through the entire learning process and perform lightweight evaluations on the fly. We hope to engage researchers into this growing field of research with this investigation.
    Instagram Fake and Automated Account Detection. (arXiv:1910.03090v3 [cs.IR] UPDATED)
    Fake engagement is one of the significant problems in Online Social Networks (OSNs) which is used to increase the popularity of an account in an inorganic manner. The detection of fake engagement is crucial because it leads to loss of money for businesses, wrong audience targeting in advertising, wrong product predictions systems, and unhealthy social network environment. This study is related with the detection of fake and automated accounts which leads to fake engagement on Instagram. Prior to this work, there were no publicly available dataset for fake and automated accounts. For this purpose, two datasets have been published for the detection of fake and automated accounts. For the detection of these accounts, machine learning algorithms like Naive Bayes, Logistic Regression, Support Vector Machines and Neural Networks are applied. Additionally, for the detection of automated accounts, cost sensitive genetic algorithm is proposed to handle the unnatural bias in the dataset. To deal with the unevenness problem in the fake dataset, Smote-nc algorithm is implemented. For the automated and fake account detection datasets, 86% and 96% classification accuracies are obtained, respectively.
    A Fully Controllable Agent in the Path Planning using Goal-Conditioned Reinforcement Learning. (arXiv:2205.09967v1 [cs.AI])
    The aim of path planning is to reach the goal from starting point by searching for the route of an agent. In the path planning, the routes may vary depending on the number of variables such that it is important for the agent to reach various goals. Numerous studies, however, have dealt with a single goal that is predefined by the user. In the present study, I propose a novel reinforcement learning framework for a fully controllable agent in the path planning. To do this, I propose a bi-directional memory editing to obtain various bi-directional trajectories of the agent, in which the behavior of the agent and sub-goals are trained on the goal-conditioned RL. As for moving the agent in various directions, I utilize the sub-goals dedicated network, separated from a policy network. Lastly, I present the reward shaping to shorten the number of steps for the agent to reach the goal. In the experimental result, the agent was able to reach the various goals that have never been visited by the agent in the training. We confirmed that the agent could perform difficult missions such as a round trip and the agent used the shorter route with the reward shaping.
    Generating Semantic Adversarial Examples via Feature Manipulation. (arXiv:2001.02297v2 [cs.LG] UPDATED)
    The vulnerability of deep neural networks to adversarial attacks has been widely demonstrated (e.g., adversarial example attacks). Traditional attacks perform unstructured pixel-wise perturbation to fool the classifier. An alternative approach is to have perturbations in the latent space. However, such perturbations are hard to control due to the lack of interpretability and disentanglement. In this paper, we propose a more practical adversarial attack by designing structured perturbation with semantic meanings. Our proposed technique manipulates the semantic attributes of images via the disentangled latent codes. The intuition behind our technique is that images in similar domains have some commonly shared but theme-independent semantic attributes, e.g. thickness of lines in handwritten digits, that can be bidirectionally mapped to disentangled latent codes. We generate adversarial perturbation by manipulating a single or a combination of these latent codes and propose two unsupervised semantic manipulation approaches: vector-based disentangled representation and feature map-based disentangled representation, in terms of the complexity of the latent codes and smoothness of the reconstructed images. We conduct extensive experimental evaluations on real-world image data to demonstrate the power of our attacks for black-box classifiers. We further demonstrate the existence of a universal, image-agnostic semantic adversarial example.
    Classification of Intra-Pulse Modulation of Radar Signals by Feature Fusion Based Convolutional Neural Networks. (arXiv:2205.09834v1 [cs.LG])
    Detection and classification of radars based on pulses they transmit is an important application in electronic warfare systems. In this work, we propose a novel deep-learning based technique that automatically recognizes intra-pulse modulation types of radar signals. Re-assigned spectrogram of measured radar signal and detected outliers of its instantaneous phases filtered by a special function are used for training multiple convolutional neural networks. Automatically extracted features from the networks are fused to distinguish frequency and phase modulated signals. Simulation results show that the proposed FF-CNN (Feature Fusion based Convolutional Neural Network) technique outperforms the current state-of-the-art alternatives and is easily scalable among broad range of modulation types.
    BayesPCN: A Continually Learnable Predictive Coding Associative Memory. (arXiv:2205.09930v1 [cs.LG])
    Associative memory plays an important role in human intelligence and its mechanisms have been linked to attention in machine learning. While the machine learning community's interest in associative memories has recently been rekindled, most work has focused on memory recall ($read$) over memory learning ($write$). In this paper, we present BayesPCN, a hierarchical associative memory capable of performing continual one-shot memory writes without meta-learning. Moreover, BayesPCN is able to gradually forget past observations ($forget$) to free its memory. Experiments show that BayesPCN can recall corrupted i.i.d. high-dimensional data observed hundreds of "timesteps" ago without a significant drop in recall ability compared to the state-of-the-art offline-learned associative memory models.
    EXODUS: Stable and Efficient Training of Spiking Neural Networks. (arXiv:2205.10242v1 [cs.NE])
    Spiking Neural Networks (SNNs) are gaining significant traction in machine learning tasks where energy-efficiency is of utmost importance. Training such networks using the state-of-the-art back-propagation through time (BPTT) is, however, very time-consuming. Previous work by Shrestha and Orchard [2018] employs an efficient GPU-accelerated back-propagation algorithm called SLAYER, which speeds up training considerably. SLAYER, however, does not take into account the neuron reset mechanism while computing the gradients, which we argue to be the source of numerical instability. To counteract this, SLAYER introduces a gradient scale hyperparameter across layers, which needs manual tuning. In this paper, (i) we modify SLAYER and design an algorithm called EXODUS, that accounts for the neuron reset mechanism and applies the Implicit Function Theorem (IFT) to calculate the correct gradients (equivalent to those computed by BPTT), (ii) we eliminate the need for ad-hoc scaling of gradients, thus, reducing the training complexity tremendously, (iii) we demonstrate, via computer simulations, that EXODUS is numerically stable and achieves a comparable or better performance than SLAYER especially in various tasks with SNNs that rely on temporal features. Our code is available at https://github.com/synsense/sinabs-exodus.
    Sample Complexity of Learning Heuristic Functions for Greedy-Best-First and A* Search. (arXiv:2205.09963v1 [cs.LG])
    Greedy best-first search (GBFS) and A* search (A*) are popular algorithms for path-finding on large graphs. Both use so-called heuristic functions, which estimate how close a vertex is to the goal. While heuristic functions have been handcrafted using domain knowledge, recent studies demonstrate that learning heuristic functions from data is effective in many applications. Motivated by this emerging approach, we study the sample complexity of learning heuristic functions for GBFS and A*. We build on a recent framework called \textit{data-driven algorithm design} and evaluate the \textit{pseudo-dimension} of a class of utility functions that measure the performance of parameterized algorithms. Assuming that a vertex set of size $n$ is fixed, we present $\mathrm{O}(n\lg n)$ and $\mathrm{O}(n^2\lg n)$ upper bounds on the pseudo-dimensions for GBFS and A*, respectively, parameterized by heuristic function values. The upper bound for A* can be improved to $\mathrm{O}(n^2\lg d)$ if every vertex has a degree of at most $d$ and to $\mathrm{O}(n \lg n)$ if edge weights are integers bounded by $\mathrm{poly}(n)$. We also give $\Omega(n)$ lower bounds for GBFS and A*, which imply that our bounds for GBFS and A* under the integer-weight condition are tight up to a $\lg n$ factor. Finally, we discuss a case where the performance of A* is measured by the suboptimality and show that we can sometimes obtain a better guarantee by combining a parameter-dependent worst-case bound with a sample complexity bound.
    Towards Explanation for Unsupervised Graph-Level Representation Learning. (arXiv:2205.09934v1 [cs.LG])
    Due to the superior performance of Graph Neural Networks (GNNs) in various domains, there is an increasing interest in the GNN explanation problem "\emph{which fraction of the input graph is the most crucial to decide the model's decision?}" Existing explanation methods focus on the supervised settings, \eg, node classification and graph classification, while the explanation for unsupervised graph-level representation learning is still unexplored. The opaqueness of the graph representations may lead to unexpected risks when deployed for high-stake decision-making scenarios. In this paper, we advance the Information Bottleneck principle (IB) to tackle the proposed explanation problem for unsupervised graph representations, which leads to a novel principle, \textit{Unsupervised Subgraph Information Bottleneck} (USIB). We also theoretically analyze the connection between graph representations and explanatory subgraphs on the label space, which reveals that the expressiveness and robustness of representations benefit the fidelity of explanatory subgraphs. Experimental results on both synthetic and real-world datasets demonstrate the superiority of our developed explainer and the validity of our theoretical analysis.
    Towards Consistency in Adversarial Classification. (arXiv:2205.10022v1 [cs.LG])
    In this paper, we study the problem of consistency in the context of adversarial examples. Specifically, we tackle the following question: can surrogate losses still be used as a proxy for minimizing the $0/1$ loss in the presence of an adversary that alters the inputs at test-time? Different from the standard classification task, this question cannot be reduced to a point-wise minimization problem, and calibration needs not to be sufficient to ensure consistency. In this paper, we expose some pathological behaviors specific to the adversarial problem, and show that no convex surrogate loss can be consistent or calibrated in this context. It is therefore necessary to design another class of surrogate functions that can be used to solve the adversarial consistency issue. As a first step towards designing such a class, we identify sufficient and necessary conditions for a surrogate loss to be calibrated in both the adversarial and standard settings. Finally, we give some directions for building a class of losses that could be consistent in the adversarial framework.
    Bayesian Active Learning with Fully Bayesian Gaussian Processes. (arXiv:2205.10186v1 [cs.LG])
    The bias-variance trade-off is a well-known problem in machine learning that only gets more pronounced the less available data there is. In active learning, where labeled data is scarce or difficult to obtain, neglecting this trade-off can cause inefficient and non-optimal querying, leading to unnecessary data labeling. In this paper, we focus on active learning with Gaussian Processes (GPs). For the GP, the bias-variance trade-off is made by optimization of the two hyperparameters: the length scale and noise-term. Considering that the optimal mode of the joint posterior of the hyperparameters is equivalent to the optimal bias-variance trade-off, we approximate this joint posterior and utilize it to design two new acquisition functions. The first one is a Bayesian variant of Query-by-Committee (B-QBC), and the second is an extension that explicitly minimizes the predictive variance through a Query by Mixture of Gaussian Processes (QB-MGP) formulation. Across six common simulators, we empirically show that B-QBC, on average, achieves the best marginal likelihood, whereas QB-MGP achieves the best predictive performance. We show that incorporating the bias-variance trade-off in the acquisition functions mitigates unnecessary and expensive data labeling.
    SafeNet: Mitigating Data Poisoning Attacks on Private Machine Learning. (arXiv:2205.09986v1 [cs.CR])
    Secure multiparty computation (MPC) has been proposed to allow multiple mutually distrustful data owners to jointly train machine learning (ML) models on their combined data. However, the datasets used for training ML models might be under the control of an adversary mounting a data poisoning attack, and MPC prevents inspecting training sets to detect poisoning. We show that multiple MPC frameworks for private ML training are susceptible to backdoor and targeted poisoning attacks. To mitigate this, we propose SafeNet, a framework for building ensemble models in MPC with formal guarantees of robustness to data poisoning attacks. We extend the security definition of private ML training to account for poisoning and prove that our SafeNet design satisfies the definition. We demonstrate SafeNet's efficiency, accuracy, and resilience to poisoning on several machine learning datasets and models. For instance, SafeNet reduces backdoor attack success from 100% to 0% for a neural network model, while achieving 39x faster training and 36x less communication than the four-party MPC framework of Dalskov et al.
    Transformer with Memory Replay. (arXiv:2205.09869v1 [cs.LG])
    Transformers achieve state-of-the-art performance for natural language processing tasks by pre-training on large-scale text corpora. They are extremely compute-intensive and have very high sample complexity. Memory replay is a mechanism that remembers and reuses past examples by saving to and replaying from a memory buffer. It has been successfully used in reinforcement learning and GANs due to better sample efficiency. In this paper, we propose \emph{Transformer with Memory Replay} (TMR), which integrates memory replay with transformer, making transformer more sample-efficient. Experiments on GLUE and SQuAD benchmark datasets show that Transformer with Memory Replay achieves at least $1\%$ point increase compared to the baseline transformer model when pretrained with the same number of examples. Further, by adopting a careful design that reduces the wall-clock time overhead of memory replay, we also empirically achieve a better runtime efficiency.
    Diverse super-resolution with pretrained deep hiererarchical VAEs. (arXiv:2205.10347v1 [cs.CV])
    Image super-resolution is a one-to-many problem, but most deep-learning based methods only provide one single solution to this problem. In this work, we tackle the problem of diverse super-resolution by reusing VD-VAE, a state-of-the art variational autoencoder (VAE). We find that the hierarchical latent representation learned by VD-VAE naturally separates the image low-frequency information, encoded in the latent groups at the top of the hierarchy, from the image high-frequency details, determined by the latent groups at the bottom of the latent hierarchy. Starting from this observation, we design a super-resolution model exploiting the specific structure of VD-VAE latent space. Specifically, we train an encoder to encode low-resolution images in the subset of VD-VAE latent space encoding the low-frequency information, and we combine this encoder with VD-VAE generative model to sample diverse super-resolved version of a low-resolution input. We demonstrate the ability of our method to generate diverse solutions to the super-resolution problem on face super-resolution with upsampling factors x4, x8, and x16.  ( 2 min )
    On pseudo-absence generation and machine learning for locust breeding ground prediction in Africa. (arXiv:2111.03904v2 [cs.LG] UPDATED)
    Desert locust outbreaks threaten the food security of a large part of Africa and have affected the livelihoods of millions of people over the years. Machine learning (ML) has been demonstrated as an effective approach to locust distribution modelling which could assist in early warning. ML requires a significant amount of labelled data to train. Most publicly available labelled data on locusts are presence-only data, where only the sightings of locusts being present at a location are recorded. Therefore, prior work using ML have resorted to pseudo-absence generation methods as a way to circumvent this issue. The most commonly used approach is to randomly sample points in a region of interest while ensuring that these sampled pseudo-absence points are at least a specific distance away from true presence points. In this paper, we compare this random sampling approach to more advanced pseudo-absence generation methods, such as environmental profiling and optimal background extent limitation, specifically for predicting desert locust breeding grounds in Africa. Interestingly, we find that for the algorithms we tested, namely logistic regression, gradient boosting, random forests and maximum entropy, all popular in prior work, the logistic model performed significantly better than the more sophisticated ensemble methods, both in terms of prediction accuracy and F1 score. Although background extent limitation combined with random sampling boosted performance for ensemble methods, for LR this was not the case, and instead, a significant improvement was obtained when using environmental profiling. In light of this, we conclude that a simpler ML approach such as logistic regression combined with more advanced pseudo-absence generation, specifically environmental profiling, can be a sensible and effective approach to predicting locust breeding grounds across Africa.  ( 3 min )
    Recurrent segmentation meets block models in temporal networks. (arXiv:2205.09862v1 [cs.SI])
    A popular approach to model interactions is to represent them as a network with nodes being the agents and the interactions being the edges. Interactions are often timestamped, which leads to having timestamped edges. Many real-world temporal networks have a recurrent or possibly cyclic behaviour. For example, social network activity may be heightened during certain hours of day. In this paper, our main interest is to model recurrent activity in such temporal networks. As a starting point we use stochastic block model, a popular choice for modelling static networks, where nodes are split into $R$ groups. We extend this model to temporal networks by modelling the edges with a Poisson process. We make the parameters of the process dependent on time by segmenting the time line into $K$ segments. To enforce the recurring activity we require that only $H < K$ different set of parameters can be used, that is, several, not necessarily consecutive, segments must share their parameters. We prove that the searching for optimal blocks and segmentation is an NP-hard problem. Consequently, we split the problem into 3 subproblems where we optimize blocks, model parameters, and segmentation in turn while keeping the remaining structures fixed. We propose an iterative algorithm that requires $O(KHm + Rn + R^2H)$ time per iteration, where $n$ and $m$ are the number of nodes and edges in the network. We demonstrate experimentally that the number of required iterations is typically low, the algorithm is able to discover the ground truth from synthetic datasets, and show that certain real-world networks exhibit recurrent behaviour as the likelihood does not deteriorate when $H$ is lowered.  ( 2 min )
    Cross Reconstruction Transformer for Self-Supervised Time Series Representation Learning. (arXiv:2205.09928v1 [cs.LG])
    Unsupervised/self-supervised representation learning in time series is critical since labeled samples are usually scarce in real-world scenarios. Existing approaches mainly leverage the contrastive learning framework, which automatically learns to understand the similar and dissimilar data pairs. Nevertheless, they are restricted to the prior knowledge of constructing pairs, cumbersome sampling policy, and unstable performances when encountering sampling bias. Also, few works have focused on effectively modeling across temporal-spectral relations to extend the capacity of representations. In this paper, we aim at learning representations for time series from a new perspective and propose Cross Reconstruction Transformer (CRT) to solve the aforementioned problems in a unified way. CRT achieves time series representation learning through a cross-domain dropping-reconstruction task. Specifically, we transform time series into the frequency domain and randomly drop certain parts in both time and frequency domains. Dropping can maximally preserve the global context compared to cropping and masking. Then a transformer architecture is utilized to adequately capture the cross-domain correlations between temporal and spectral information through reconstructing data in both domains, which is called Dropped Temporal-Spectral Modeling. To discriminate the representations in global latent space, we propose Instance Discrimination Constraint to reduce the mutual information between different time series and sharpen the decision boundaries. Additionally, we propose a specified curriculum learning strategy to optimize the CRT, which progressively increases the dropping ratio in the training process.  ( 2 min )
    ClusterEA: Scalable Entity Alignment with Stochastic Training and Normalized Mini-batch Similarities. (arXiv:2205.10312v1 [cs.DB])
    Entity alignment (EA) aims at finding equivalent entities in different knowledge graphs (KGs). Embedding-based approaches have dominated the EA task in recent years. Those methods face problems that come from the geometric properties of embedding vectors, including hubness and isolation. To solve these geometric problems, many normalization approaches have been adopted to EA. However, the increasing scale of KGs renders it is hard for EA models to adopt the normalization processes, thus limiting their usage in real-world applications. To tackle this challenge, we present ClusterEA, a general framework that is capable of scaling up EA models and enhancing their results by leveraging normalization methods on mini-batches with a high entity equivalent rate. ClusterEA contains three components to align entities between large-scale KGs, including stochastic training, ClusterSampler, and SparseFusion. It first trains a large-scale Siamese GNN for EA in a stochastic fashion to produce entity embeddings. Based on the embeddings, a novel ClusterSampler strategy is proposed for sampling highly overlapped mini-batches. Finally, ClusterEA incorporates SparseFusion, which normalizes local and global similarity and then fuses all similarity matrices to obtain the final similarity matrix. Extensive experiments with real-life datasets on EA benchmarks offer insight into the proposed framework, and suggest that it is capable of outperforming the state-of-the-art scalable EA framework by up to 8 times in terms of Hits@1.  ( 2 min )
    On the SDEs and Scaling Rules for Adaptive Gradient Algorithms. (arXiv:2205.10287v1 [cs.LG])
    Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differential Equation (SDE) has allowed researchers to enjoy the benefits of studying a continuous optimization trajectory while carefully preserving the stochasticity of SGD. Analogous study of adaptive gradient methods, such as RMSprop and Adam, has been challenging because there were no rigorously proven SDE approximations for these methods. This paper derives the SDE approximations for RMSprop and Adam, giving theoretical guarantees of their correctness as well as experimental validation of their applicability to common large-scaling vision and language settings. A key practical result is the derivation of a $\textit{square root scaling rule}$ to adjust the optimization hyperparameters of RMSprop and Adam when changing batch size, and its empirical validation in deep learning settings.  ( 2 min )
    HeadText: Exploring Hands-free Text Entry using Head Gestures by Motion Sensing on a Smart Earpiece. (arXiv:2205.09978v1 [cs.HC])
    We present HeadText, a hands-free technique on a smart earpiece for text entry by motion sensing. Users input text utilizing only 7 head gestures for key selection, word selection, word commitment and word cancelling tasks. Head gesture recognition is supported by motion sensing on a smart earpiece to capture head moving signals and machine learning algorithms (K-Nearest-Neighbor (KNN) with a Dynamic Time Warping (DTW) distance measurement). A 10-participant user study proved that HeadText could recognize 7 head gestures at an accuracy of 94.29%. After that, the second user study presented that HeadText could achieve a maximum accuracy of 10.65 WPM and an average accuracy of 9.84 WPM for text entry. Finally, we demonstrate potential applications of HeadText in hands-free scenarios for (a). text entry of people with motor impairments, (b). private text entry, and (c). socially acceptable text entry.  ( 2 min )
    Interpolating Compressed Parameter Subspaces. (arXiv:2205.09891v1 [cs.LG])
    Inspired by recent work on neural subspaces and mode connectivity, we revisit parameter subspace sampling for shifted and/or interpolatable input distributions (instead of a single, unshifted distribution). We enforce a compressed geometric structure upon a set of trained parameters mapped to a set of train-time distributions, denoting the resulting subspaces as Compressed Parameter Subspaces (CPS). We show the success and failure modes of the types of shifted distributions whose optimal parameters reside in the CPS. We find that ensembling point-estimates within a CPS can yield a high average accuracy across a range of test-time distributions, including backdoor, adversarial, permutation, stylization and rotation perturbations. We also find that the CPS can contain low-loss point-estimates for various task shifts (albeit interpolated, perturbed, unseen or non-identical coarse labels). We further demonstrate this property in a continual learning setting with CIFAR100.  ( 2 min )
    ExMo: Explainable AI Model using Inverse Frequency Decision Rules. (arXiv:2205.10045v1 [cs.AI])
    In this paper, we present a novel method to compute decision rules to build a more accurate interpretable machine learning model, denoted as ExMo. The ExMo interpretable machine learning model consists of a list of IF...THEN... statements with a decision rule in the condition. This way, ExMo naturally provides an explanation for a prediction using the decision rule that was triggered. ExMo uses a new approach to extract decision rules from the training data using term frequency-inverse document frequency (TF-IDF) features. With TF-IDF, decision rules with feature values that are more relevant to each class are extracted. Hence, the decision rules obtained by ExMo can distinguish the positive and negative classes better than the decision rules used in the existing Bayesian Rule List (BRL) algorithm, obtained using the frequent pattern mining approach. The paper also shows that ExMo learns a qualitatively better model than BRL. Furthermore, ExMo demonstrates that the textual explanation can be provided in a human-friendly way so that the explanation can be easily understood by non-expert users. We validate ExMo on several datasets with different sizes to evaluate its efficacy. Experimental validation on a real-world fraud detection application shows that ExMo is 20% more accurate than BRL and that it achieves accuracy similar to those of deep learning models.  ( 2 min )
    Visualizing and Explaining Language Models. (arXiv:2205.10238v1 [cs.CL])
    During the last decade, Natural Language Processing has become, after Computer Vision, the second field of Artificial Intelligence that was massively changed by the advent of Deep Learning. Regardless of the architecture, the language models of the day need to be able to process or generate text, as well as predict missing words, sentences or relations depending on the task. Due to their black-box nature, such models are difficult to interpret and explain to third parties. Visualization is often the bridge that language model designers use to explain their work, as the coloring of the salient words and phrases, clustering or neuron activations can be used to quickly understand the underlying models. This paper showcases the techniques used in some of the most popular Deep Learning for NLP visualizations, with a special focus on interpretability and explainability.  ( 2 min )
    Understanding and Mitigating the Uncertainty in Zero-Shot Translation. (arXiv:2205.10068v1 [cs.CL])
    Zero-shot translation is a promising direction for building a comprehensive multilingual neural machine translation (MNMT) system. However, its quality is still not satisfactory due to off-target issues. In this paper, we aim to understand and alleviate the off-target issues from the perspective of uncertainty in zero-shot translation. By carefully examining the translation output and model confidence, we identify two uncertainties that are responsible for the off-target issues, namely, extrinsic data uncertainty and intrinsic model uncertainty. Based on the observations, we propose two light-weight and complementary approaches to denoise the training data for model training, and mask out the vocabulary of the off-target languages in inference. Extensive experiments on both balanced and unbalanced datasets show that our approaches significantly improve the performance of zero-shot translation over strong MNMT baselines. Qualitative analyses provide insights into where our approaches reduce off-target translations  ( 2 min )
    Estimating the frame potential of large-scale quantum circuit sampling using tensor networks up to 50 qubits. (arXiv:2205.09900v1 [quant-ph])
    We develop numerical protocols for estimating the frame potential, the 2-norm distance between a given ensemble and the exact Haar randomness, using the \texttt{QTensor} platform. Our tensor-network-based algorithm has polynomial complexity for shallow circuits and is high performing using CPU and GPU parallelism. We apply the above methods to two problems: the Brown-Susskind conjecture, with local and parallel random circuits in terms of the Haar distance and the approximate $k$-design properties of the hardware efficient ans{\"a}tze in quantum machine learning, which induce the barren plateau problem. We estimate frame potentials with these ensembles up to 50 qubits and $k=5$, examine the Haar distance of the hardware-efficient ans{\"a}tze, and verify the Brown-Susskind conjecture numerically. Our work shows that large-scale tensor network simulations could provide important hints toward open problems in quantum information science.  ( 2 min )
    FairNorm: Fair and Fast Graph Neural Network Training. (arXiv:2205.09977v1 [cs.LG])
    Graph neural networks (GNNs) have been demonstrated to achieve state-of-the-art for a number of graph-based learning tasks, which leads to a rise in their employment in various domains. However, it has been shown that GNNs may inherit and even amplify bias within training data, which leads to unfair results towards certain sensitive groups. Meanwhile, training of GNNs introduces additional challenges, such as slow convergence and possible instability. Faced with these limitations, this work proposes FairNorm, a unified normalization framework that reduces the bias in GNN-based learning while also providing provably faster convergence. Specifically, FairNorm employs fairness-aware normalization operators over different sensitive groups with learnable parameters to reduce the bias in GNNs. The design of FairNorm is built upon analyses that illuminate the sources of bias in graph-based learning. Experiments on node classification over real-world networks demonstrate the efficiency of the proposed scheme in improving fairness in terms of statistical parity and equal opportunity compared to fairness-aware baselines. In addition, it is empirically shown that the proposed framework leads to faster convergence compared to the naive baseline where no normalization is employed.  ( 2 min )
    Swapping Semantic Contents for Mixing Images. (arXiv:2205.10158v1 [cs.CV])
    Deep architecture have proven capable of solving many tasks provided a sufficient amount of labeled data. In fact, the amount of available labeled data has become the principal bottleneck in low label settings such as Semi-Supervised Learning. Mixing Data Augmentations do not typically yield new labeled samples, as indiscriminately mixing contents creates between-class samples. In this work, we introduce the SciMix framework that can learn to generator to embed a semantic style code into image backgrounds, we obtain new mixing scheme for data augmentation. We then demonstrate that SciMix yields novel mixed samples that inherit many characteristics from their non-semantic parents. Afterwards, we verify those samples can be used to improve the performance semi-supervised frameworks like Mean Teacher or Fixmatch, and even fully supervised learning on a small labeled dataset.  ( 2 min )
    On Calibration of Ensemble-Based Credal Predictors. (arXiv:2205.10082v1 [stat.ML])
    In recent years, several classification methods that intend to quantify epistemic uncertainty have been proposed, either by producing predictions in the form of second-order distributions or sets of probability distributions. In this work, we focus on the latter, also called credal predictors, and address the question of how to evaluate them: What does it mean that a credal predictor represents epistemic uncertainty in a faithful manner? To answer this question, we refer to the notion of calibration of probabilistic predictors and extend it to credal predictors. Broadly speaking, we call a credal predictor calibrated if it returns sets that cover the true conditional probability distribution. To verify this property for the important case of ensemble-based credal predictors, we propose a novel nonparametric calibration test that generalizes an existing test for probabilistic predictors to the case of credal predictors. Making use of this test, we empirically show that credal predictors based on deep neural networks are often not well calibrated.  ( 2 min )
    Trend analysis and forecasting air pollution in Rwanda. (arXiv:2205.10024v1 [stat.ML])
    Air pollution is a major public health problem worldwide although the lack of data is a global issue for most low and middle income countries. Ambient air pollution in the form of fine particulate matter (PM2.5) exceeds the World Health Organization guidelines in Rwanda with a daily average of around 42.6 microgram per meter cube. Monitoring and mitigation strategies require an expensive investment in equipment to collect pollution data. Low-cost sensor technology and machine learning methods have appeared as an alternative solution to get reliable information for decision making. This paper analyzes the trend of air pollution in Rwanda and proposes forecasting models suitable to data collected by a network of low-cost sensors deployed in Rwanda.  ( 2 min )
    Data Augmentation for Compositional Data: Advancing Predictive Models of the Microbiome. (arXiv:2205.09906v1 [stat.ML])
    Data augmentation plays a key role in modern machine learning pipelines. While numerous augmentation strategies have been studied in the context of computer vision and natural language processing, less is known for other data modalities. Our work extends the success of data augmentation to compositional data, i.e., simplex-valued data, which is of particular interest in the context of the human microbiome. Drawing on key principles from compositional data analysis, such as the Aitchison geometry of the simplex and subcompositions, we define novel augmentation strategies for this data modality. Incorporating our data augmentations into standard supervised learning pipelines results in consistent performance gains across a wide range of standard benchmark datasets. In particular, we set a new state-of-the-art for key disease prediction tasks including colorectal cancer, type 2 diabetes, and Crohn's disease. In addition, our data augmentations enable us to define a novel contrastive learning model, which improves on previous representation learning approaches for microbiome compositional data. Our code is available at https://github.com/cunningham-lab/AugCoDa.  ( 2 min )
    Conformal Prediction with Temporal Quantile Adjustments. (arXiv:2205.09940v1 [stat.ML])
    We develop Temporal Quantile Adjustment (TQA), a general method to construct efficient and valid prediction intervals (PIs) for regression on cross-sectional time series data. Such data is common in many domains, including econometrics and healthcare. A canonical example in healthcare is predicting patient outcomes using physiological time-series data, where a population of patients composes a cross-section. Reliable PI estimators in this setting must address two distinct notions of coverage: cross-sectional coverage across a cross-sectional slice, and longitudinal coverage along the temporal dimension for each time series. Recent works have explored adapting Conformal Prediction (CP) to obtain PIs in the time series context. However, none handles both notions of coverage simultaneously. CP methods typically query a pre-specified quantile from the distribution of nonconformity scores on a calibration set. TQA adjusts the quantile to query in CP at each time $t$, accounting for both cross-sectional and longitudinal coverage in a theoretically-grounded manner. The post-hoc nature of TQA facilitates its use as a general wrapper around any time series regression model. We validate TQA's performance through extensive experimentation: TQA generally obtains efficient PIs and improves longitudinal coverage while preserving cross-sectional coverage.  ( 2 min )
    A General Framework for quantifying Aleatoric and Epistemic uncertainty in Graph Neural Networks. (arXiv:2205.09968v1 [cs.LG])
    Graph Neural Networks (GNN) provide a powerful framework that elegantly integrates Graph theory with Machine learning for modeling and analysis of networked data. We consider the problem of quantifying the uncertainty in predictions of GNN stemming from modeling errors and measurement uncertainty. We consider aleatoric uncertainty in the form of probabilistic links and noise in feature vector of nodes, while epistemic uncertainty is incorporated via a probability distribution over the model parameters. We propose a unified approach to treat both sources of uncertainty in a Bayesian framework, where Assumed Density Filtering is used to quantify aleatoric uncertainty and Monte Carlo dropout captures uncertainty in model parameters. Finally, the two sources of uncertainty are aggregated to estimate the total uncertainty in predictions of a GNN. Results in the real-world datasets demonstrate that the Bayesian model performs at par with a frequentist model and provides additional information about predictions uncertainty that are sensitive to uncertainties in the data and model.  ( 2 min )
    The Unreasonable Effectiveness of Deep Evidential Regression. (arXiv:2205.10060v1 [cs.LG])
    There is a significant need for principled uncertainty reasoning in machine learning systems as they are increasingly deployed in safety-critical domains. A new approach with uncertainty-aware regression-based neural networks (NNs), based on learning evidential distributions for aleatoric and epistemic uncertainties, shows promise over traditional deterministic methods and typical Bayesian NNs, notably with the capabilities to disentangle aleatoric and epistemic uncertainties. Despite some empirical success of Deep Evidential Regression (DER), there are important gaps in the mathematical foundation that raise the question of why the proposed technique seemingly works. We detail the theoretical shortcomings and analyze the performance on synthetic and real-world data sets, showing that Deep Evidential Regression is a heuristic rather than an exact uncertainty quantification. We go on to propose corrections and redefinitions of how aleatoric and epistemic uncertainties should be extracted from NNs.  ( 2 min )
    Summarization as Indirect Supervision for Relation Extraction. (arXiv:2205.09837v1 [cs.CL])
    Relation extraction (RE) models have been challenged by their reliance on training data with expensive annotations. Considering that summarization tasks aim at acquiring concise expressions of synoptical information from the longer context, these tasks naturally align with the objective of RE, i.e., extracting a kind of synoptical information that describes the relation of entity mentions. We present SuRE, which converts RE into a summarization formulation. SuRE leads to more precise and resource-efficient RE based on indirect supervision from summarization tasks. To achieve this goal, we develop sentence and relation conversion techniques that essentially bridge the formulation of summarization and RE tasks. We also incorporate constraint decoding techniques with Trie scoring to further enhance summarization-based RE with robust inference. Experiments on three RE datasets demonstrate the effectiveness of SuRE in both full-dataset and low-resource settings, showing that summarization is a promising source of indirect supervision to improve RE models.  ( 2 min )
    Self-Supervised Depth Estimation with Isometric-Self-Sample-Based Learning. (arXiv:2205.10006v1 [cs.CV])
    Managing the dynamic regions in the photometric loss formulation has been a main issue for handling the self-supervised depth estimation problem. Most previous methods have alleviated this issue by removing the dynamic regions in the photometric loss formulation based on the masks estimated from another module, making it difficult to fully utilize the training images. In this paper, to handle this problem, we propose an isometric self-sample-based learning (ISSL) method to fully utilize the training images in a simple yet effective way. The proposed method provides additional supervision during training using self-generated images that comply with pure static scene assumption. Specifically, the isometric self-sample generator synthesizes self-samples for each training image by applying random rigid transformations on the estimated depth. Thus both the generated self-samples and the corresponding training image always follow the static scene assumption. We show that plugging our ISSL module into several existing models consistently improves the performance by a large margin. In addition, it also boosts the depth accuracy over different types of scene, i.e., outdoor scenes (KITTI and Make3D) and indoor scene (NYUv2), validating its high effectiveness.  ( 2 min )
    Deep Learning Methods for Proximal Inference via Maximum Moment Restriction. (arXiv:2205.09824v1 [stat.ML])
    The No Unmeasured Confounding Assumption is widely used to identify causal effects in observational studies. Recent work on proximal inference has provided alternative identification results that succeed even in the presence of unobserved confounders, provided that one has measured a sufficiently rich set of proxy variables, satisfying specific structural conditions. However, proximal inference requires solving an ill-posed integral equation. Previous approaches have used a variety of machine learning techniques to estimate a solution to this integral equation, commonly referred to as the bridge function. However, prior work has often been limited by relying on pre-specified kernel functions, which are not data adaptive and struggle to scale to large datasets. In this work, we introduce a flexible and scalable method based on a deep neural network to estimate causal effects in the presence of unmeasured confounding using proximal inference. Our method achieves state of the art performance on two well-established proximal inference benchmarks. Finally, we provide theoretical consistency guarantees for our method.  ( 2 min )
    Survey on Fair Reinforcement Learning: Theory and Practice. (arXiv:2205.10032v1 [cs.LG])
    Fairness-aware learning aims at satisfying various fairness constraints in addition to the usual performance criteria via data-driven machine learning techniques. Most of the research in fairness-aware learning employs the setting of fair-supervised learning. However, many dynamic real-world applications can be better modeled using sequential decision-making problems and fair reinforcement learning provides a more suitable alternative for addressing these problems. In this article, we provide an extensive overview of fairness approaches that have been implemented via a reinforcement learning (RL) framework. We discuss various practical applications in which RL methods have been applied to achieve a fair solution with high accuracy. We further include various facets of the theory of fair reinforcement learning, organizing them into single-agent RL, multi-agent RL, long-term fairness via RL, and offline learning. Moreover, we highlight a few major issues to explore in order to advance the field of fair-RL, namely - i) correcting societal biases, ii) feasibility of group fairness or individual fairness, and iii) explainability in RL. Our work is beneficial for both researchers and practitioners as we discuss articles providing mathematical guarantees as well as articles with empirical studies on real-world problems.  ( 2 min )
    FedNoiL: A Simple Two-Level Sampling Method for Federated Learning with Noisy Labels. (arXiv:2205.10110v1 [cs.LG])
    Federated learning (FL) aims at training a global model on the server side while the training data are collected and located at the local devices. Hence, the labels in practice are usually annotated by clients of varying expertise or criteria and thus contain different amounts of noises. Local training on noisy labels can easily result in overfitting to noisy labels, which is devastating to the global model through aggregation. Although recent robust FL methods take malicious clients into account, they have not addressed local noisy labels on each device and the impact to the global model. In this paper, we develop a simple two-level sampling method "FedNoiL" that (1) selects clients for more robust global aggregation on the server; and (2) selects clean labels and correct pseudo-labels at the client end for more robust local training. The sampling probabilities are built upon clean label detection by the global model. Moreover, we investigate different schedules changing the local epochs between aggregations over the course of FL, which notably improves the communication and computation efficiency in noisy label setting. In experiments with homogeneous/heterogeneous data distributions and noise ratios, we observed that direct combinations of SOTA FL methods with SOTA noisy-label learning methods can easily fail but our method consistently achieves better and robust performance.  ( 2 min )
    Exploring Extreme Parameter Compression for Pre-trained Language Models. (arXiv:2205.10036v1 [cs.CL])
    Recent work explored the potential of large-scale Transformer-based pre-trained models, especially Pre-trained Language Models (PLMs) in natural language processing. This raises many concerns from various perspectives, e.g., financial costs and carbon emissions. Compressing PLMs like BERT with negligible performance loss for faster inference and cheaper deployment has attracted much attention. In this work, we aim to explore larger compression ratios for PLMs, among which tensor decomposition is a potential but under-investigated one. Two decomposition and reconstruction protocols are further proposed to improve the effectiveness and efficiency during compression. Our compressed BERT with ${1}/{7}$ parameters in Transformer layers performs on-par with, sometimes slightly better than the original BERT in GLUE benchmark. A tiny version achieves $96.7\%$ performance of BERT-base with $ {1}/{48} $ encoder parameters (i.e., less than 2M parameters excluding the embedding layer) and $2.7 \times$ faster on inference. To show that the proposed method is orthogonal to existing compression methods like knowledge distillation, we also explore the benefit of the proposed method on a distilled BERT.  ( 2 min )
    MiDAS: Multi-integrated Domain Adaptive Supervision for Fake News Detection. (arXiv:2205.09817v1 [cs.LG])
    COVID-19 related misinformation and fake news, coined an 'infodemic', has dramatically increased over the past few years. This misinformation exhibits concept drift, where the distribution of fake news changes over time, reducing effectiveness of previously trained models for fake news detection. Given a set of fake news models trained on multiple domains, we propose an adaptive decision module to select the best-fit model for a new sample. We propose MiDAS, a multi-domain adaptative approach for fake news detection that ranks relevancy of existing models to new samples. MiDAS contains 2 components: a doman-invariant encoder, and an adaptive model selector. MiDAS integrates multiple pre-trained and fine-tuned models with their training data to create a domain-invariant representation. Then, MiDAS uses local Lipschitz smoothness of the invariant embedding space to estimate each model's relevance to a new sample. Higher ranked models provide predictions, and lower ranked models abstain. We evaluate MiDAS on generalization to drifted data with 9 fake news datasets, each obtained from different domains and modalities. MiDAS achieves new state-of-the-art performance on multi-domain adaptation for out-of-distribution fake news classification.  ( 2 min )
    Neural Additive Models for Nowcasting. (arXiv:2205.10020v1 [cs.LG])
    Deep neural networks (DNNs) are one of the most highlighted methods in machine learning. However, as DNNs are black-box models, they lack explanatory power for their predictions. Recently, neural additive models (NAMs) have been proposed to provide this power while maintaining high prediction performance. In this paper, we propose a novel NAM approach for multivariate nowcasting (NC) problems, which comprise an important focus area of machine learning. For the multivariate time-series data used in NC problems, explanations should be considered for every input value to the variables at distinguishable time steps. By employing generalized additive models, the proposed NAM-NC successfully explains each input value's importance for multiple variables and time steps. Experimental results involving a toy example and two real-world datasets show that the NAM-NC predicts multivariate time-series data as accurately as state-of-the-art neural networks, while also providing the explanatory importance of each input value. We also examine parameter-sharing networks using NAM-NC to decrease their complexity, and NAM-MC's hard-tied feature net extracted explanations with good performance.  ( 2 min )
    Service Delay Minimization for Federated Learning over Mobile Devices. (arXiv:2205.09868v1 [cs.LG])
    Federated learning (FL) over mobile devices has fostered numerous intriguing applications/services, many of which are delay-sensitive. In this paper, we propose a service delay efficient FL (SDEFL) scheme over mobile devices. Unlike traditional communication efficient FL, which regards wireless communications as the bottleneck, we find that under many situations, the local computing delay is comparable to the communication delay during the FL training process, given the development of high-speed wireless transmission techniques. Thus, the service delay in FL should be computing delay + communication delay over training rounds. To minimize the service delay of FL, simply reducing local computing/communication delay independently is not enough. The delay trade-off between local computing and wireless communications must be considered. Besides, we empirically study the impacts of local computing control and compression strategies (i.e., the number of local updates, weight quantization, and gradient quantization) on computing, communication and service delays. Based on those trade-off observation and empirical studies, we develop an optimization scheme to minimize the service delay of FL over heterogeneous devices. We establish testbeds and conduct extensive emulations/experiments to verify our theoretical analysis. The results show that SDEFL reduces notable service delay with a small accuracy drop compared to peer designs.  ( 2 min )
    Residual Dynamic Mode Decomposition: Robust and verified Koopmanism. (arXiv:2205.09779v1 [physics.flu-dyn])
    Dynamic Mode Decomposition (DMD) describes complex dynamic processes through a hierarchy of simpler coherent features. DMD is regularly used to understand the fundamental characteristics of turbulence and is closely related to Koopman operators. However, verifying the decomposition, equivalently the computed spectral features of Koopman operators, remains a major challenge due to the infinite-dimensional nature of Koopman operators. Challenges include spurious (unphysical) modes, and dealing with continuous spectra, both of which occur regularly in turbulent flows. Residual Dynamic Mode Decomposition (ResDMD), introduced by (Colbrook & Townsend 2021), overcomes some of these challenges through the data-driven computation of residuals associated with the full infinite-dimensional Koopman operator. ResDMD computes spectra and pseudospectra of general Koopman operators with error control, and computes smoothed approximations of spectral measures (including continuous spectra) with explicit high-order convergence theorems. ResDMD thus provides robust and verified Koopmanism. We implement ResDMD and demonstrate its application in a variety of fluid dynamic situations, at varying Reynolds numbers, arising from both numerical and experimental data. Examples include: vortex shedding behind a cylinder; hot-wire data acquired in a turbulent boundary layer; particle image velocimetry data focusing on a wall-jet flow; and acoustic pressure signals of laser-induced plasma. We present some advantages of ResDMD, namely, the ability to verifiably resolve non-linear, transient modes, and spectral calculation with reduced broadening effects. We also discuss how a new modal ordering based on residuals enables greater accuracy with a smaller dictionary than the traditional modulus ordering. This paves the way for greater dynamic compression of large datasets without sacrificing accuracy.  ( 2 min )
    Why GANs are overkill for NLP. (arXiv:2205.09838v1 [cs.LG])
    This work offers a novel theoretical perspective on why, despite numerous attempts, adversarial approaches to generative modeling (e.g., GANs) have not been as popular for certain generation tasks, particularly sequential tasks such as Natural Language Generation, as they have in others, such as Computer Vision. In particular, on sequential data such as text, maximum-likelihood approaches are significantly more utilized than GANs. We show that, while it may seem that maximizing likelihood is inherently different than minimizing distinguishability, this distinction is largely artificial and only holds for limited models. We argue that minimizing KL-divergence (i.e., maximizing likelihood) is a more efficient approach to effectively minimizing the same distinguishability criteria that adversarial models seek to optimize. Reductions show that minimizing distinguishability can be seen as simply boosting likelihood for certain families of models including n-gram models and neural networks with a softmax output layer. To achieve a full polynomial-time reduction, a novel next-token distinguishability model is considered.  ( 2 min )
    Discrete-Convex-Analysis-Based Framework for Warm-Starting Algorithms with Predictions. (arXiv:2205.09961v1 [cs.LG])
    Augmenting algorithms with learned predictions is a promising approach for going beyond worst-case bounds. Dinitz, Im, Lavastida, Moseley, and Vassilvitskii~(2021) have demonstrated that a warm start with learned dual solutions can improve the time complexity of the Hungarian method for weighted perfect bipartite matching. We extend and improve their framework in a principled manner via \textit{discrete convex analysis} (DCA), a discrete analog of convex analysis. We show the usefulness of our DCA-based framework by applying it to weighted perfect bipartite matching, weighted matroid intersection, and discrete energy minimization for computer vision. Our DCA-based framework yields time complexity bounds that depend on the $\ell_\infty$-distance from a predicted solution to an optimal solution, which has two advantages relative to the previous $\ell_1$-distance-dependent bounds: time complexity bounds are smaller, and learning of predictions is more sample efficient. We also discuss whether to learn primal or dual solutions from the DCA perspective.  ( 2 min )
    A Learning-Based Approach to Approximate Coded Computation. (arXiv:2205.09818v1 [cs.IT])
    Lagrange coded computation (LCC) is essential to solving problems about matrix polynomials in a coded distributed fashion; nevertheless, it can only solve the problems that are representable as matrix polynomials. In this paper, we propose AICC, an AI-aided learning approach that is inspired by LCC but also uses deep neural networks (DNNs). It is appropriate for coded computation of more general functions. Numerical simulations demonstrate the suitability of the proposed approach for the coded computation of different matrix functions that are often utilized in digital signal processing.  ( 2 min )
    Calibration Matters: Tackling Maximization Bias in Large-scale Advertising Recommendation Systems. (arXiv:2205.09809v1 [cs.LG])
    Calibration is defined as the ratio of the average predicted click rate to the true click rate. The optimization of calibration is essential to many online advertising recommendation systems because it directly affects the downstream bids in ads auctions and the amount of money charged to advertisers. Despite its importance, calibration optimization often suffers from a problem called "maximization bias". Maximization bias refers to the phenomenon that the maximum of predicted values overestimates the true maximum. The problem is introduced because the calibration is computed on the set selected by the prediction model itself. It persists even if unbiased predictions can be achieved on every datapoint and worsens when covariate shifts exist between the training and test sets. To mitigate this problem, we theorize the quantification of maximization bias and propose a variance-adjusting debiasing (VAD) meta-algorithm in this paper. The algorithm is efficient, robust, and practical as it is able to mitigate maximization bias problems under covariate shifts, neither incurring additional online serving costs nor compromising the ranking performance. We demonstrate the effectiveness of the proposed algorithm using a state-of-the-art recommendation neural network model on a large-scale real-world dataset.  ( 2 min )
    Estimation of Entropy in Constant Space with Improved Sample Complexity. (arXiv:2205.09804v1 [cs.DS])
    Recent work of Acharya et al. (NeurIPS 2019) showed how to estimate the entropy of a distribution $\mathcal D$ over an alphabet of size $k$ up to $\pm\epsilon$ additive error by streaming over $(k/\epsilon^3) \cdot \text{polylog}(1/\epsilon)$ i.i.d. samples and using only $O(1)$ words of memory. In this work, we give a new constant memory scheme that reduces the sample complexity to $(k/\epsilon^2)\cdot \text{polylog}(1/\epsilon)$. We conjecture that this is optimal up to $\text{polylog}(1/\epsilon)$ factors.  ( 2 min )
    Causal Discovery and Injection for Feed-Forward Neural Networks. (arXiv:2205.09787v1 [cs.LG])
    Neural networks have proven to be effective at solving a wide range of problems but it is often unclear whether they learn any meaningful causal relationship: this poses a problem for the robustness of neural network models and their use for high-stakes decisions. We propose a novel method overcoming this issue by injecting knowledge in the form of (possibly partial) causal graphs into feed-forward neural networks, so that the learnt model is guaranteed to conform to the graph, hence adhering to expert knowledge. This knowledge may be given up-front or during the learning process, to improve the model through human-AI collaboration. We apply our method to synthetic and real (tabular) data showing that it is robust against noise and can improve causal discovery and prediction performance in low data regimes.  ( 2 min )
    Label-invariant Augmentation for Semi-Supervised Graph Classification. (arXiv:2205.09802v1 [cs.CV])
    Recently, contrastiveness-based augmentation surges a new climax in the computer vision domain, where some operations, including rotation, crop, and flip, combined with dedicated algorithms, dramatically increase the model generalization and robustness. Following this trend, some pioneering attempts employ the similar idea to graph data. Nevertheless, unlike images, it is much more difficult to design reasonable augmentations without changing the nature of graphs. Although exciting, the current graph contrastive learning does not achieve as promising performance as visual contrastive learning. We conjecture the current performance of graph contrastive learning might be limited by the violation of the label-invariant augmentation assumption. In light of this, we propose a label-invariant augmentation for graph-structured data to address this challenge. Different from the node/edge modification and subgraph extraction, we conduct the augmentation in the representation space and generate the augmented samples in the most difficult direction while keeping the label of augmented data the same as the original samples. In the semi-supervised scenario, we demonstrate our proposed method outperforms the classical graph neural network based methods and recent graph contrastive learning on eight benchmark graph-structured data, followed by several in-depth experiments to further explore the label-invariant augmentation in several aspects.  ( 2 min )
    Graph Neural Networks Are More Powerful Than we Think. (arXiv:2205.09801v1 [cs.LG])
    Graph Neural Networks (GNNs) are powerful convolutional architectures that have shown remarkable performance in various node-level and graph-level tasks. Despite their success, the common belief is that the expressive power of GNNs is limited and that they are at most as discriminative as the Weisfeiler-Lehman (WL) algorithm. In this paper we argue the opposite and show that the WL algorithm is the upper bound only when the input to the GNN is the vector of all ones. In this direction, we derive an alternative analysis that employs linear algebraic tools and characterize the representational power of GNNs with respect to the eigenvalue decomposition of the graph operators. We show that GNNs can distinguish between any graphs that differ in at least one eigenvalue and design simple GNN architectures that are provably more expressive than the WL algorithm. Thorough experimental analysis on graph isomorphism and graph classification datasets corroborates our theoretical results and demonstrates the effectiveness of the proposed architectures.  ( 2 min )
    HDGT: Heterogeneous Driving Graph Transformer for Multi-Agent Trajectory Prediction via Scene Encoding. (arXiv:2205.09753v1 [cs.AI])
    One essential task for autonomous driving is to encode the information of a driving scene into vector representations so that the downstream task such as trajectory prediction could perform well. The driving scene is complicated, and there exists heterogeneity within elements, where they own diverse types of information i.e., agent dynamics, map routing, road lines, etc. Meanwhile, there also exist relativity across elements - meaning they have spatial relations with each other; such relations should be canonically represented regarding the relative measurements since the absolute value of the coordinate is meaningless. Taking these two observations into consideration, we propose a novel backbone, namely Heterogeneous Driving Graph Transformer (HDGT), which models the driving scene as a heterogeneous graph with different types of nodes and edges. For graph construction, each node represents either an agent or a road element and each edge represents their semantics relations such as Pedestrian-To-Crosswalk, Lane-To-Left-Lane. As for spatial relation encoding, instead of setting a fixed global reference, the coordinate information of the node as well as its in-edges is transformed to the local node-centric coordinate system. For the aggregation module in the graph neural network (GNN), we adopt the transformer structure in a hierarchical way to fit the heterogeneous nature of inputs. Experimental results show that the proposed method achieves new state-of-the-art on INTERACTION Prediction Challenge and Waymo Open Motion Challenge, in which we rank 1st and 2nd respectively regarding the minADE/minFDE metric.  ( 2 min )
  • Open

    Long-Range Transformers for Dynamic Spatiotemporal Forecasting. (arXiv:2109.12218v2 [cs.LG] UPDATED)
    Multivariate Time Series Forecasting focuses on the prediction of future values based on historical context. State-of-the-art sequence-to-sequence models rely on neural attention between timesteps, which allows for temporal learning but fails to consider distinct spatial relationships between variables. In contrast, methods based on graph neural networks explicitly model variable relationships. However, these methods often rely on predefined graphs and perform separate spatial and temporal updates without establishing direct connections between each variable at every timestep. This paper addresses these problems by translating multivariate forecasting into a spatiotemporal sequence formulation where each Transformer input token represents the value of a single variable at a given time. Long-Range Transformers can then learn interactions between space, time, and value information jointly along this extended sequence. Our method, which we call Spacetimeformer, achieves competitive results on benchmarks from traffic forecasting to electricity demand and weather prediction while learning fully-connected spatiotemporal relationships purely from data.
    Missing Data Imputation and Acquisition with Deep Hierarchical Models and Hamiltonian Monte Carlo. (arXiv:2202.04599v2 [cs.LG] UPDATED)
    Variational Autoencoders (VAEs) have recently been highly successful at imputing and acquiring heterogeneous missing data. However, within this specific application domain, existing VAE methods are restricted by using only one layer of latent variables and strictly Gaussian posterior approximations. To address these limitations, we present HH-VAEM, a Hierarchical VAE model for mixed-type incomplete data that uses Hamiltonian Monte Carlo with automatic hyper-parameter tuning for improved approximate inference. Our experiments show that HH-VAEM outperforms existing baselines in the tasks of missing data imputation and supervised learning with missing features. Finally, we also present a sampling-based approach for efficiently computing the information gain when missing features are to be acquired with HH-VAEM. Our experiments show that this sampling-based approach is superior to alternatives based on Gaussian approximations.
    Trend analysis and forecasting air pollution in Rwanda. (arXiv:2205.10024v1 [stat.ML])
    Air pollution is a major public health problem worldwide although the lack of data is a global issue for most low and middle income countries. Ambient air pollution in the form of fine particulate matter (PM2.5) exceeds the World Health Organization guidelines in Rwanda with a daily average of around 42.6 microgram per meter cube. Monitoring and mitigation strategies require an expensive investment in equipment to collect pollution data. Low-cost sensor technology and machine learning methods have appeared as an alternative solution to get reliable information for decision making. This paper analyzes the trend of air pollution in Rwanda and proposes forecasting models suitable to data collected by a network of low-cost sensors deployed in Rwanda.
    An alternative proof of the vulnerability of retrieval in high intrinsic dimensionality neighborhood. (arXiv:2010.00990v2 [cs.LG] UPDATED)
    This paper investigates the vulnerability of the nearest neighbors search, which is a pivotal tool in data analysis and machine learning. The vulnerability is gauged as the relative amount of perturbation that an attacker needs to add onto a dataset point in order to modify its neighbor rank w.r.t. a query. The statistical distribution of this quantity is derived from simple assumptions. Experiments on six large scale datasets validate this model up to some outliers which are explained in term of violations of the assumptions.
    Understanding Why Generalized Reweighting Does Not Improve Over ERM. (arXiv:2201.12293v3 [cs.LG] UPDATED)
    Empirical risk minimization (ERM) is known in practice to be non-robust to distributional shift where the training and the test distributions are different. A suite of approaches, such as importance weighting, and variants of distributionally robust optimization (DRO), have been proposed to solve this problem. But a line of recent work has empirically shown that these approaches do not significantly improve over ERM in real applications with distribution shift. The goal of this work is to obtain a comprehensive theoretical understanding of this intriguing phenomenon. We first posit the class of Generalized Reweighting (GRW) algorithms, as a broad category of approaches that iteratively update model parameters based on iterative reweighting of the training samples. We show that when overparameterized models are trained under GRW, the resulting models are close to that obtained by ERM. We also show that adding small regularization which does not greatly affect the empirical training accuracy does not help. Together, our results show that a broad category of what we term GRW approaches are not able to achieve distributionally robust generalization. Our work thus has the following sobering takeaway: to make progress towards distributionally robust generalization, we either have to develop non-GRW approaches, or perhaps devise novel classification/regression loss functions that are adapted to the class of GRW approaches.
    Robust Expected Information Gain for Optimal Bayesian Experimental Design Using Ambiguity Sets. (arXiv:2205.09914v1 [stat.ML])
    The ranking of experiments by expected information gain (EIG) in Bayesian experimental design is sensitive to changes in the model's prior distribution, and the approximation of EIG yielded by sampling will have errors similar to the use of a perturbed prior. We define and analyze \emph{robust expected information gain} (REIG), a modification of the objective in EIG maximization by minimizing an affine relaxation of EIG over an ambiguity set of distributions that are close to the original prior in KL-divergence. We show that, when combined with a sampling-based approach to estimating EIG, REIG corresponds to a `log-sum-exp' stabilization of the samples used to estimate EIG, meaning that it can be efficiently implemented in practice. Numerical tests combining REIG with variational nested Monte Carlo (VNMC), adaptive contrastive estimation (ACE) and mutual information neural estimation (MINE) suggest that in practice REIG also compensates for the variability of under-sampled estimators.
    Deep electric field predictions by drift-reduced Braginskii theory with plasma-neutral interactions based upon experimental images of boundary turbulence. (arXiv:2204.11689v1 [physics.plasm-ph] CROSS LISTED)
    We present 2-dimensional turbulent electric field calculations via physics-informed deep learning consistent with (i) drift-reduced Braginskii theory under the framework of an axisymmetric fusion plasma with purely toroidal field and (ii) experimental estimates of the fluctuating electron density and temperature obtained from analysis of gas puff imaging of a discharge on the Alcator C-Mod tokamak. The inclusion of effects from the locally puffed atomic helium on particle and energy sources within the reduced plasma turbulence model are found to strengthen correlations between the electric field and electron pressure. The neutrals are also directly associated with an observed broadening in the distribution of turbulent field amplitudes and increased ${\bf E \times B}$ shearing rates.
    Posterior Refinement Improves Sample Efficiency in Bayesian Neural Networks. (arXiv:2205.10041v1 [cs.LG])
    Monte Carlo (MC) integration is the de facto method for approximating the predictive distribution of Bayesian neural networks (BNNs). But, even with many MC samples, Gaussian-based BNNs could still yield bad predictive performance due to the posterior approximation's error. Meanwhile, alternatives to MC integration tend to be more expensive or biased. In this work, we experimentally show that the key to good MC-approximated predictive distributions is the quality of the approximate posterior itself. However, previous methods for obtaining accurate posterior approximations are expensive and non-trivial to implement. We, therefore, propose to refine Gaussian approximate posteriors with normalizing flows. When applied to last-layer BNNs, it yields a simple \emph{post hoc} method for improving pre-existing parametric approximations. We show that the resulting posterior approximation is competitive with even the gold-standard full-batch Hamiltonian Monte Carlo.
    A Case of Exponential Convergence Rates for SVM. (arXiv:2205.10055v1 [stat.ML])
    Classification is often the first problem described in introductory machine learning classes. Generalization guarantees of classification have historically been offered by Vapnik-Chervonenkis theory. Yet those guarantees are based on intractable algorithms, which has led to the theory of surrogate methods in classification. Guarantees offered by surrogate methods are based on calibration inequalities, which have been shown to be highly sub-optimal under some margin conditions, failing short to capture exponential convergence phenomena. Those "super" fast rates are becoming to be well understood for smooth surrogates, but the picture remains blurry for non-smooth losses such as the hinge loss, associated with the renowned support vector machines. In this paper, we present a simple mechanism to obtain fast convergence rates and we investigate its usage for SVM. In particular, we show that SVM can exhibit exponential convergence rates even without assuming the hard Tsybakov margin condition.
    Triangulation candidates for Bayesian optimization. (arXiv:2112.07457v2 [stat.CO] UPDATED)
    Bayesian optimization involves "inner optimization" over a new-data acquisition criterion which is non-convex/highly multi-modal, may be non-differentiable, or may otherwise thwart local numerical optimizers. In such cases it is common to replace continuous search with a discrete one over random candidates. Here we propose using candidates based on a Delaunay triangulation of the existing input design. We detail the construction of these "tricands" and demonstrate empirically how they outperform both numerically optimized acquisitions and random candidate-based alternatives, and are well-suited for hybrid schemes, on benchmark synthetic and real simulation experiments.
    Speeding up PCA with priming. (arXiv:2109.03709v3 [cs.LG] UPDATED)
    We introduce primed-PCA (pPCA), a two-step algorithm for speeding up the approximation of principal components. This algorithm first runs any approximate-PCA method to get an initial estimate of the principal components (priming), and then applies an exact PCA in the subspace they span. Since this subspace is of small dimension in any practical use, the second step is extremely cheap computationally. Nonetheless, it improves accuracy significantly for a given computational budget across datasets. In this setup, the purpose of the priming is to narrow down the search space, and prepare the data for the second step, an exact calculation. We show formally that pPCA improves upon the priming algorithm under very mild conditions, and we provide experimental validation on both synthetic and real large-scale datasets showing that it systematically translates to improved performance. In our experiments we prime pPCA by several approximate algorithms and report an average speedup by a factor of 7.2 over Oja's rule, and a factor of 10.5 over EigenGame.  ( 2 min )
    Interpretable Personalization via Policy Learning with Linear Decision Boundaries. (arXiv:2003.07545v3 [cs.LG] UPDATED)
    With the rise of the digital economy and an explosion of available information on consumers, effective personalization of offers, goods, and services has become a core business focus for companies to improve revenues and maintain competitive edge. This paper studies the personalization problem through the lens of policy learning, where the goal is to learn a decision-making rule (a policy) that maps from consumer and product characteristics (features) to recommendations (actions) in order to optimize outcomes (rewards). We focus on using available historical data for offline learning with unknown data collection procedure. Importantly, in many business and medical settings, interpretability of a policy is essential. To address these challenges, we study the class of policies with linear decision boundaries and propose learning algorithms using tools from causal inference. We propose several optimization schemes to solve the associated non-convex, non-smooth optimization problem, and find that an adapted Bayesian optimization algorithm is fast and effective. We test our algorithm with extensive simulation studies and apply it to an online marketplace customer purchase dataset, where the learned policy outputs a personalized discount recommendation based on customer and product features in order to maximize gross merchandise value (GMV) for sellers. Our learned policy improves upon the platform's baseline by 88.2\% in net sales revenue, while also providing informative insights on which features are important for the decision-making process, e.g. when "Attribute 2" is large, marginal increase in GMV is low for discounts higher than 10\%. Our findings suggest that the proposed policy learning algorithm provides a promising practical approach for interpretable personalization across a wide range of applications.  ( 2 min )
    Algorithms for Weak Optimal Transport with an Application to Economics. (arXiv:2205.09825v1 [stat.ML])
    The theory of weak optimal transport (WOT), introduced by [Gozlan et al., 2017], generalizes the classic Monge-Kantorovich framework by allowing the transport cost between one point and the points it is matched with to be nonlinear. In the so-called barycentric version of WOT, the cost for transporting a point $x$ only depends on $x$ and on the barycenter of the points it is matched with. This aggregation property of WOT is appealing in machine learning, economics and finance. Yet algorithms to compute WOT have only been developed for the special case of quadratic barycentric WOT, or depend on neural networks with no guarantee on the computed value and matching. The main difficulty lies in the transportation constraints which are costly to project onto. In this paper, we propose to use mirror descent algorithms to solve the primal and dual versions of the WOT problem. We also apply our algorithms to the variant of WOT introduced by [Chon\'e et al., 2022] where mass is distributed from one space to another through unnormalized kernels (WOTUK). We empirically compare the solutions of WOT and WOTUK with classical OT. We illustrate our numerical methods to the economic framework of [Chon\'e and Kramarz, 2021], namely the matching between workers and firms on labor markets.
    Invariance principle of random projection for the norm. (arXiv:2112.00300v2 [math.PR] UPDATED)
    Johnson-Lindenstrauss guarantees certain topological structure is preserved under random projections when project high dimensional deterministic vectors to low dimensional vectors. In this work, we try to understand how random matrix affect norms of random vectors. In particular we prove the distribution of the norm of random vector $X \in \mathbb{R}^n$, whose entries are i.i.d. random variables, is preserved by random projection $S:\mathbb{R}^n \to \mathbb{R}^m$. More precisely, \[ \frac{X^TS^TSX - mn}{\sqrt{\sigma^2 m^2n+2mn^2}} \xrightarrow[\quad m/n\to 0 \quad ]{ m,n\to \infty } \mathcal{N}(0,1) \] We also prove a concentration of the random norm transformed by either random projection or random embedding. Overall, our results showed random matrix has low distortion for the norm of random vectors with i.i.d. entries.
    Nonlinear Initialization Methods for Low-Rank Neural Networks. (arXiv:2202.00834v3 [cs.LG] UPDATED)
    We propose a novel low-rank initialization framework for training low-rank deep neural networks -- networks where the weight parameters are re-parameterized by products of two low-rank matrices. The most successful prior existing approach, spectral initialization, draws a sample from the initialization distribution for the full-rank setting and then optimally approximates the full-rank initialization parameters in the Frobenius norm with a pair of low-rank initialization matrices via singular value decomposition. Our method is inspired by the insight that approximating the function corresponding to each layer is more important than approximating the parameter values. We provably demonstrate that there is a significant gap between these two approaches for ReLU networks, particularly as the desired rank of the approximating weights decreases, or as the dimension of the inputs to the layer increases (the latter point holds when the network width is super-linear in dimension). Along the way, we provide the first provably efficient algorithm for solving the ReLU low-rank approximation problem for fixed parameter rank $r$ -- previously, it was unknown that the problem was computationally tractable to solve even for rank $1$. We also provide a practical algorithm to solve this problem which is no more expensive than the existing spectral initialization approach, and validate our theory by training ResNet and EfficientNet models (He et al., 2016; Tan & Le, 2019) on ImageNet (Russakovsky et al., 2015).
    Semi-self-supervised Automated ICD Coding. (arXiv:2205.10088v1 [cs.CL])
    Clinical Text Notes (CTNs) contain physicians' reasoning process, written in an unstructured free text format, as they examine and interview patients. In recent years, several studies have been published that provide evidence for the utility of machine learning for predicting doctors' diagnoses from CTNs, a task known as ICD coding. Data annotation is time consuming, particularly when a degree of specialization is needed, as is the case for medical data. This paper presents a method of augmenting a sparsely annotated dataset of Icelandic CTNs with a machine-learned imputation in a semi-self-supervised manner. We train a neural network on a small set of annotated CTNs and use it to extract clinical features from a set of un-annotated CTNs. These clinical features consist of answers to about a thousand potential questions that a physician might find the answers to during a consultation of a patient. The features are then used to train a classifier for the diagnosis of certain types of diseases. We report the results of an evaluation of this data augmentation method over three tiers of data availability to the physician. Our data augmentation method shows a significant positive effect which is diminished when clinical features from the examination of the patient and diagnostics are made available. We recommend our method for augmenting scarce datasets for systems that take decisions based on clinical features that do not include examinations or tests.
    A New Central Limit Theorem for the Augmented IPW Estimator: Variance Inflation, Cross-Fit Covariance and Beyond. (arXiv:2205.10198v1 [math.ST])
    Estimation of the average treatment effect (ATE) is a central problem in causal inference. In recent times, inference for the ATE in the presence of high-dimensional covariates has been extensively studied. Among the diverse approaches that have been proposed, augmented inverse probability weighting (AIPW) with cross-fitting has emerged as a popular choice in practice. In this work, we study this cross-fit AIPW estimator under well-specified outcome regression and propensity score models in a high-dimensional regime where the number of features and samples are both large and comparable. Under assumptions on the covariate distribution, we establish a new CLT for the suitably scaled cross-fit AIPW that applies without any sparsity assumptions on the underlying high-dimensional parameters. Our CLT uncovers two crucial phenomena among others: (i) the AIPW exhibits a substantial variance inflation that can be precisely quantified in terms of the signal-to-noise ratio and other problem parameters, (ii) the asymptotic covariance between the pre-cross-fit estimates is non-negligible even on the root-n scale. In fact, these cross-covariances turn out to be negative in our setting. These findings are strikingly different from their classical counterparts. On the technical front, our work utilizes a novel interplay between three distinct tools--approximate message passing theory, the theory of deterministic equivalents, and the leave-one-out approach. We believe our proof techniques should be useful for analyzing other two-stage estimators in this high-dimensional regime. Finally, we complement our theoretical results with simulations that demonstrate both the finite sample efficacy of our CLT and its robustness to our assumptions.
    Instagram Fake and Automated Account Detection. (arXiv:1910.03090v3 [cs.IR] UPDATED)
    Fake engagement is one of the significant problems in Online Social Networks (OSNs) which is used to increase the popularity of an account in an inorganic manner. The detection of fake engagement is crucial because it leads to loss of money for businesses, wrong audience targeting in advertising, wrong product predictions systems, and unhealthy social network environment. This study is related with the detection of fake and automated accounts which leads to fake engagement on Instagram. Prior to this work, there were no publicly available dataset for fake and automated accounts. For this purpose, two datasets have been published for the detection of fake and automated accounts. For the detection of these accounts, machine learning algorithms like Naive Bayes, Logistic Regression, Support Vector Machines and Neural Networks are applied. Additionally, for the detection of automated accounts, cost sensitive genetic algorithm is proposed to handle the unnatural bias in the dataset. To deal with the unevenness problem in the fake dataset, Smote-nc algorithm is implemented. For the automated and fake account detection datasets, 86% and 96% classification accuracies are obtained, respectively.
    Remember and Forget Experience Replay for Multi-Agent Reinforcement Learning. (arXiv:2203.13319v2 [cs.LG] UPDATED)
    We present the extension of the Remember and Forget for Experience Replay (ReF-ER) algorithm to Multi-Agent Reinforcement Learning (MARL). {ReF-ER} was shown to outperform state of the art algorithms for continuous control in problems ranging from the OpenAI Gym to complex fluid flows. In MARL, the dependencies between the agents are included in the state-value estimator and the environment dynamics are modeled via the importance weights used by ReF-ER. In collaborative environments, we find the best performance when the value is estimated using individual rewards and we ignore the effects of other actions on the transition map. We benchmark the performance of ReF-ER MARL on the Stanford Intelligent Systems Laboratory (SISL) environments. We find that employing a single feed-forward neural network for the policy and the value function in ReF-ER MARL, outperforms state of the art algorithms that rely on complex neural network architectures.
    On Calibration of Ensemble-Based Credal Predictors. (arXiv:2205.10082v1 [stat.ML])
    In recent years, several classification methods that intend to quantify epistemic uncertainty have been proposed, either by producing predictions in the form of second-order distributions or sets of probability distributions. In this work, we focus on the latter, also called credal predictors, and address the question of how to evaluate them: What does it mean that a credal predictor represents epistemic uncertainty in a faithful manner? To answer this question, we refer to the notion of calibration of probabilistic predictors and extend it to credal predictors. Broadly speaking, we call a credal predictor calibrated if it returns sets that cover the true conditional probability distribution. To verify this property for the important case of ensemble-based credal predictors, we propose a novel nonparametric calibration test that generalizes an existing test for probabilistic predictors to the case of credal predictors. Making use of this test, we empirically show that credal predictors based on deep neural networks are often not well calibrated.
    Why GANs are overkill for NLP. (arXiv:2205.09838v1 [cs.LG])
    This work offers a novel theoretical perspective on why, despite numerous attempts, adversarial approaches to generative modeling (e.g., GANs) have not been as popular for certain generation tasks, particularly sequential tasks such as Natural Language Generation, as they have in others, such as Computer Vision. In particular, on sequential data such as text, maximum-likelihood approaches are significantly more utilized than GANs. We show that, while it may seem that maximizing likelihood is inherently different than minimizing distinguishability, this distinction is largely artificial and only holds for limited models. We argue that minimizing KL-divergence (i.e., maximizing likelihood) is a more efficient approach to effectively minimizing the same distinguishability criteria that adversarial models seek to optimize. Reductions show that minimizing distinguishability can be seen as simply boosting likelihood for certain families of models including n-gram models and neural networks with a softmax output layer. To achieve a full polynomial-time reduction, a novel next-token distinguishability model is considered.  ( 2 min )
    Conformal Prediction with Temporal Quantile Adjustments. (arXiv:2205.09940v1 [stat.ML])
    We develop Temporal Quantile Adjustment (TQA), a general method to construct efficient and valid prediction intervals (PIs) for regression on cross-sectional time series data. Such data is common in many domains, including econometrics and healthcare. A canonical example in healthcare is predicting patient outcomes using physiological time-series data, where a population of patients composes a cross-section. Reliable PI estimators in this setting must address two distinct notions of coverage: cross-sectional coverage across a cross-sectional slice, and longitudinal coverage along the temporal dimension for each time series. Recent works have explored adapting Conformal Prediction (CP) to obtain PIs in the time series context. However, none handles both notions of coverage simultaneously. CP methods typically query a pre-specified quantile from the distribution of nonconformity scores on a calibration set. TQA adjusts the quantile to query in CP at each time $t$, accounting for both cross-sectional and longitudinal coverage in a theoretically-grounded manner. The post-hoc nature of TQA facilitates its use as a general wrapper around any time series regression model. We validate TQA's performance through extensive experimentation: TQA generally obtains efficient PIs and improves longitudinal coverage while preserving cross-sectional coverage.
    Bounding the Effects of Continuous Treatments for Hidden Confounders. (arXiv:2204.11206v2 [stat.ME] UPDATED)
    Observational studies often seek to infer the causal effect of a treatment even though both the assigned treatment and the outcome depend on other confounding variables. An effective strategy for dealing with confounders is to estimate a propensity model that corrects for the relationship between covariates and assigned treatment. Unfortunately, the confounding variables themselves are not always observed, in which case we can only bound the propensity, and therefore bound the magnitude of causal effects. In many important cases, like administering a dose of some medicine, the possible treatments belong to a continuum. Sensitivity models, which are required to tie the true propensity to something that can be estimated, have been explored for binary treatments. We propose one for continuous treatments. We develop a framework to compute ignorance intervals on the partially identified dose-response curves, enabling us to quantify the susceptibility of an inference to hidden confounders. We show with simulations and three real-world observational studies that our approach can give non-trivial bounds on causal effects from continuous treatments in the presence of hidden confounders.
    Almost exact recovery in noisy semi-supervised learning. (arXiv:2007.14717v3 [cs.LG] UPDATED)
    Graph-based semi-supervised learning methods combine the graph structure and labeled data to classify unlabeled data. In this work, we study the effect of a noisy oracle on classification. In particular, we derive the Maximum A Posteriori (MAP) estimator for clustering a Degree Corrected Stochastic Block Model (DC-SBM) when a noisy oracle reveals a fraction of the labels. We then propose an algorithm derived from a continuous relaxation of the MAP, and we establish its consistency. Numerical experiments show that our approach achieves promising performance on synthetic and real data sets, even in the case of very noisy labeled data.
    Bootstrapping the error of Oja's algorithm. (arXiv:2106.14857v2 [math.ST] UPDATED)
    We consider the problem of quantifying uncertainty for the estimation error of the leading eigenvector from Oja's algorithm for streaming principal component analysis, where the data are generated IID from some unknown distribution. By combining classical tools from the U-statistics literature with recent results on high-dimensional central limit theorems for quadratic forms of random vectors and concentration of matrix products, we establish a weighted $\chi^2$ approximation result for the $\sin^2$ error between the population eigenvector and the output of Oja's algorithm. Since estimating the covariance matrix associated with the approximating distribution requires knowledge of unknown model parameters, we propose a multiplier bootstrap algorithm that may be updated in an online manner. We establish conditions under which the bootstrap distribution is close to the corresponding sampling distribution with high probability, thereby establishing the bootstrap as a consistent inferential method in an appropriate asymptotic regime.
    Deep Learning Methods for Proximal Inference via Maximum Moment Restriction. (arXiv:2205.09824v1 [stat.ML])
    The No Unmeasured Confounding Assumption is widely used to identify causal effects in observational studies. Recent work on proximal inference has provided alternative identification results that succeed even in the presence of unobserved confounders, provided that one has measured a sufficiently rich set of proxy variables, satisfying specific structural conditions. However, proximal inference requires solving an ill-posed integral equation. Previous approaches have used a variety of machine learning techniques to estimate a solution to this integral equation, commonly referred to as the bridge function. However, prior work has often been limited by relying on pre-specified kernel functions, which are not data adaptive and struggle to scale to large datasets. In this work, we introduce a flexible and scalable method based on a deep neural network to estimate causal effects in the presence of unmeasured confounding using proximal inference. Our method achieves state of the art performance on two well-established proximal inference benchmarks. Finally, we provide theoretical consistency guarantees for our method.
    The Unreasonable Effectiveness of Deep Evidential Regression. (arXiv:2205.10060v1 [cs.LG])
    There is a significant need for principled uncertainty reasoning in machine learning systems as they are increasingly deployed in safety-critical domains. A new approach with uncertainty-aware regression-based neural networks (NNs), based on learning evidential distributions for aleatoric and epistemic uncertainties, shows promise over traditional deterministic methods and typical Bayesian NNs, notably with the capabilities to disentangle aleatoric and epistemic uncertainties. Despite some empirical success of Deep Evidential Regression (DER), there are important gaps in the mathematical foundation that raise the question of why the proposed technique seemingly works. We detail the theoretical shortcomings and analyze the performance on synthetic and real-world data sets, showing that Deep Evidential Regression is a heuristic rather than an exact uncertainty quantification. We go on to propose corrections and redefinitions of how aleatoric and epistemic uncertainties should be extracted from NNs.
    A General Framework for quantifying Aleatoric and Epistemic uncertainty in Graph Neural Networks. (arXiv:2205.09968v1 [cs.LG])
    Graph Neural Networks (GNN) provide a powerful framework that elegantly integrates Graph theory with Machine learning for modeling and analysis of networked data. We consider the problem of quantifying the uncertainty in predictions of GNN stemming from modeling errors and measurement uncertainty. We consider aleatoric uncertainty in the form of probabilistic links and noise in feature vector of nodes, while epistemic uncertainty is incorporated via a probability distribution over the model parameters. We propose a unified approach to treat both sources of uncertainty in a Bayesian framework, where Assumed Density Filtering is used to quantify aleatoric uncertainty and Monte Carlo dropout captures uncertainty in model parameters. Finally, the two sources of uncertainty are aggregated to estimate the total uncertainty in predictions of a GNN. Results in the real-world datasets demonstrate that the Bayesian model performs at par with a frequentist model and provides additional information about predictions uncertainty that are sensitive to uncertainties in the data and model.
    Memorization and Optimization in Deep Neural Networks with Minimum Over-parameterization. (arXiv:2205.10217v1 [stat.ML])
    The Neural Tangent Kernel (NTK) has emerged as a powerful tool to provide memorization, optimization and generalization guarantees in deep neural networks. A line of work has studied the NTK spectrum for two-layer and deep networks with at least a layer with $\Omega(N)$ neurons, $N$ being the number of training samples. Furthermore, there is increasing evidence suggesting that deep networks with sub-linear layer widths are powerful memorizers and optimizers, as long as the number of parameters exceeds the number of samples. Thus, a natural open question is whether the NTK is well conditioned in such a challenging sub-linear setup. In this paper, we answer this question in the affirmative. Our key technical contribution is a lower bound on the smallest NTK eigenvalue for deep networks with the minimum possible over-parameterization: the number of parameters is roughly $\Omega(N)$ and, hence, the number of neurons is as little as $\Omega(\sqrt{N})$. To showcase the applicability of our NTK bounds, we provide two results concerning memorization capacity and optimization guarantees for gradient descent training.
    Generating Semantic Adversarial Examples via Feature Manipulation. (arXiv:2001.02297v2 [cs.LG] UPDATED)
    The vulnerability of deep neural networks to adversarial attacks has been widely demonstrated (e.g., adversarial example attacks). Traditional attacks perform unstructured pixel-wise perturbation to fool the classifier. An alternative approach is to have perturbations in the latent space. However, such perturbations are hard to control due to the lack of interpretability and disentanglement. In this paper, we propose a more practical adversarial attack by designing structured perturbation with semantic meanings. Our proposed technique manipulates the semantic attributes of images via the disentangled latent codes. The intuition behind our technique is that images in similar domains have some commonly shared but theme-independent semantic attributes, e.g. thickness of lines in handwritten digits, that can be bidirectionally mapped to disentangled latent codes. We generate adversarial perturbation by manipulating a single or a combination of these latent codes and propose two unsupervised semantic manipulation approaches: vector-based disentangled representation and feature map-based disentangled representation, in terms of the complexity of the latent codes and smoothness of the reconstructed images. We conduct extensive experimental evaluations on real-world image data to demonstrate the power of our attacks for black-box classifiers. We further demonstrate the existence of a universal, image-agnostic semantic adversarial example.  ( 2 min )
    Deep reinforcement learning under signal temporal logic constraints using Lagrangian relaxation. (arXiv:2201.08504v3 [stat.ML] UPDATED)
    Deep reinforcement learning (DRL) has attracted much attention as an approach to solve sequential decision making problems without mathematical models of systems or environments. In general, a constraint may be imposed on a decision making. In this study, we consider the optimal decision making problems with constraints to complete temporal high-level tasks in the continuous state-action domain. We describe the constraints using signal temporal logic (STL), which is useful for time sensitive control tasks since it can specify continuous signals within a bounded time interval. To deal with the STL constraints, we introduce an extended constrained Markov decision process (CMDP), which is called a $\tau$-CMDP. We formulate the STL constrained optimal decision making problem as the $\tau$-CMDP and propose a two-phase constrained DRL algorithm using the Lagrangian relaxation method. Through simulations, we also demonstrate the learning performance of the proposed algorithm.  ( 2 min )
    Mitigating Statistical Bias within Differentially Private Synthetic Data. (arXiv:2108.10934v3 [stat.ML] UPDATED)
    Increasing interest in privacy-preserving machine learning has led to new and evolved approaches for generating private synthetic data from undisclosed real data. However, mechanisms of privacy preservation can significantly reduce the utility of synthetic data, which in turn impacts downstream tasks such as learning predictive models or inference. We propose several re-weighting strategies using privatised likelihood ratios that not only mitigate statistical bias of downstream estimators but also have general applicability to differentially private generative models. Through large-scale empirical evaluation, we show that private importance weighting provides simple and effective privacy-compliant augmentation for general applications of synthetic data.  ( 2 min )
    EXODUS: Stable and Efficient Training of Spiking Neural Networks. (arXiv:2205.10242v1 [cs.NE])
    Spiking Neural Networks (SNNs) are gaining significant traction in machine learning tasks where energy-efficiency is of utmost importance. Training such networks using the state-of-the-art back-propagation through time (BPTT) is, however, very time-consuming. Previous work by Shrestha and Orchard [2018] employs an efficient GPU-accelerated back-propagation algorithm called SLAYER, which speeds up training considerably. SLAYER, however, does not take into account the neuron reset mechanism while computing the gradients, which we argue to be the source of numerical instability. To counteract this, SLAYER introduces a gradient scale hyperparameter across layers, which needs manual tuning. In this paper, (i) we modify SLAYER and design an algorithm called EXODUS, that accounts for the neuron reset mechanism and applies the Implicit Function Theorem (IFT) to calculate the correct gradients (equivalent to those computed by BPTT), (ii) we eliminate the need for ad-hoc scaling of gradients, thus, reducing the training complexity tremendously, (iii) we demonstrate, via computer simulations, that EXODUS is numerically stable and achieves a comparable or better performance than SLAYER especially in various tasks with SNNs that rely on temporal features. Our code is available at https://github.com/synsense/sinabs-exodus.  ( 2 min )
    B-cos Networks: Alignment is All We Need for Interpretability. (arXiv:2205.10268v1 [cs.CV])
    We present a new direction for increasing the interpretability of deep neural networks (DNNs) by promoting weight-input alignment during training. For this, we propose to replace the linear transforms in DNNs by our B-cos transform. As we show, a sequence (network) of such transforms induces a single linear transform that faithfully summarises the full model computations. Moreover, the B-cos transform introduces alignment pressure on the weights during optimisation. As a result, those induced linear transforms become highly interpretable and align with task-relevant features. Importantly, the B-cos transform is designed to be compatible with existing architectures and we show that it can easily be integrated into common models such as VGGs, ResNets, InceptionNets, and DenseNets, whilst maintaining similar performance on ImageNet. The resulting explanations are of high visual quality and perform well under quantitative metrics for interpretability. Code available at https://www.github.com/moboehle/B-cos.
    Optimizing the Communication-Accuracy Trade-off in Federated Learning with Rate-Distortion Theory. (arXiv:2201.02664v3 [cs.LG] UPDATED)
    A significant bottleneck in federated learning (FL) is the network communication cost of sending model updates from client devices to the central server. We present a comprehensive empirical study of the statistics of model updates in FL, as well as the role and benefits of various compression techniques. Motivated by these observations, we propose a novel method to reduce the average communication cost, which is near-optimal in many use cases, and outperforms Top-K, DRIVE, 3LC and QSGD on Stack Overflow next-word prediction, a realistic and challenging FL benchmark. This is achieved by examining the problem using rate-distortion theory, and proposing distortion as a reliable proxy for model accuracy. Distortion can be more effectively used for optimizing the trade-off between model performance and communication cost across clients. We demonstrate empirically that in spite of the non-i.i.d. nature of federated learning, the rate-distortion frontier is consistent across datasets, optimizers, clients and training rounds.
    Sparse Infinite Random Feature Latent Variable Modeling. (arXiv:2205.09909v1 [stat.ML])
    We propose a non-linear, Bayesian non-parametric latent variable model where the latent space is assumed to be sparse and infinite dimensional a priori using an Indian buffet process prior. A posteriori, the number of instantiated dimensions in the latent space is guaranteed to be finite. The purpose of placing the Indian buffet process on the latent variables is to: 1.) Automatically and probabilistically select the number of latent dimensions. 2.) Impose sparsity in the latent space, where the Indian buffet process will select which elements are exactly zero. Our proposed model allows for sparse, non-linear latent variable modeling where the number of latent dimensions is selected automatically. Inference is made tractable using the random Fourier approximation and we can easily implement posterior inference through Markov chain Monte Carlo sampling. This approach is amenable to many observation models beyond the Gaussian setting. We demonstrate the utility of our method on a variety of synthetic, biological and text datasets and show that we can obtain superior test set performance compared to previous latent variable models.
    Lifelong Neural Predictive Coding: Learning Cumulatively Online without Forgetting. (arXiv:1905.10696v3 [cs.LG] UPDATED)
    In lifelong learning systems based on artificial neural networks, one of the biggest obstacles is the inability to retain old knowledge as new information is encountered. This phenomenon is known as catastrophic forgetting. In this paper, we propose a new kind of connectionist architecture, the Sequential Neural Coding Network, that is robust to forgetting when learning from streams of data points and, unlike networks of today, does not learn via the popular back-propagation of errors. Grounded in the neurocognitive theory of predictive processing, our model adapts synapses in a biologically-plausible fashion while another neural system learns to direct and control this cortex-like structure by mimicking some of task-executive control functionality of the basal ganglia. In our experiments, we demonstrate that our self-organizing system experiences significantly less forgetting compared to standard neural models, outperforming a swath of previously proposed methods, including rehearsal/data buffer-based methods, on both standard (SplitMNIST, Split Fashion MNIST, etc.) and custom benchmarks even though it is trained in a stream-like fashion. Our work offers evidence that emulating mechanisms in real neuronal systems, e.g., local learning, lateral competition, can yield new directions for tackling the grand challenge of lifelong machine learning.
    On Algorithmic Stability in Unsupervised Representation Learning. (arXiv:2106.05238v3 [cs.LG] UPDATED)
    In this paper, we investigate the algorithmic stability of unsupervised representation learning with deep generative models, as a function of repeated re-training on the same input data. Algorithms for learning low dimensional linear representations -- for example principal components analysis (PCA), or linear independent components analysis (ICA) -- come with guarantees that they will always reveal the same latent representations (perhaps up to an arbitrary rotation or permutation). Unfortunately, for non-linear representation learning, such as in a variational auto-encoder (VAE) model trained by stochastic gradient descent, we have no such guarantees. Recent work on identifiability in non-linear ICA have introduced a family of deep generative models that have identifiable latent representations, achieved by conditioning on side information (e.g. informative labels). We empirically evaluate the stability of these models under repeated re-estimation of parameters, and compare them to both standard VAEs and deep generative models which learn to cluster in their latent space. Surprisingly, we discover side information is not necessary for algorithmic stability: using standard quantitative measures of identifiability, we find deep generative models with latent clusterings are empirically identifiable to the same degree as models which rely on auxiliary labels. We relate these results to the possibility of identifiable non-linear ICA.
    Sequentially learning the topological ordering of causal directed acyclic graphs with likelihood ratio scores. (arXiv:2202.01748v2 [stat.ME] UPDATED)
    Causal discovery, the learning of causality in a data mining scenario, has been of strong scientific and theoretical interest as a starting point to identify "what causes what?" Contingent on assumptions and a proper learning algorithm, it is sometimes possible to identify and accurately estimate a causal directed acyclic graph (DAG), as opposed to a Markov equivalence class of graphs that gives ambiguity of causal directions. The focus of this paper is in highlighting the identifiability and estimation of DAGs with general error distributions through a general sequential sorting procedure that orders variables one at a time, starting at root nodes, followed by children of the root nodes, and so on until completion. We demonstrate a novel application of this general approach to estimate the topological ordering of a DAG. At each step of the procedure, only simple likelihood ratio scores are calculated on regression residuals to decide the next node to append to the current partial ordering. The computational complexity of our algorithm on a p-node problem is O(pd), where d is the maximum neighborhood size. Under mild assumptions, the population version of our procedure provably identifies a true ordering of the underlying DAG. We provide extensive numerical evidence to demonstrate that this sequential procedure scales to possibly thousands of nodes and works well for high-dimensional data. We accompany these numerical experiments with an application to a single-cell gene expression dataset.
    The Fairness of Credit Scoring Models. (arXiv:2205.10200v1 [stat.ML])
    In credit markets, screening algorithms aim to discriminate between good-type and bad-type borrowers. However, when doing so, they also often discriminate between individuals sharing a protected attribute (e.g. gender, age, racial origin) and the rest of the population. In this paper, we show how (1) to test whether there exists a statistically significant difference between protected and unprotected groups, which we call lack of fairness and (2) to identify the variables that cause the lack of fairness. We then use these variables to optimize the fairness-performance trade-off. Our framework provides guidance on how algorithmic fairness can be monitored by lenders, controlled by their regulators, and improved for the benefit of protected groups.
    Counterfactual Temporal Point Processes. (arXiv:2111.07603v2 [cs.LG] UPDATED)
    Machine learning models based on temporal point processes are the state of the art in a wide variety of applications involving discrete events in continuous time. However, these models lack the ability to answer counterfactual questions, which are increasingly relevant as these models are being used to inform targeted interventions. In this work, our goal is to fill this gap. To this end, we first develop a causal model of thinning for temporal point processes that builds upon the Gumbel-Max structural causal model. This model satisfies a desirable counterfactual monotonicity condition, which is sufficient to identify counterfactual dynamics in the process of thinning. Then, given an observed realization of a temporal point process with a given intensity function, we develop a sampling algorithm that uses the above causal model of thinning and the superposition theorem to simulate counterfactual realizations of the temporal point process under a given alternative intensity function. Simulation experiments using synthetic and real epidemiological data show that the counterfactual realizations provided by our algorithm may give valuable insights to enhance targeted interventions.
    The Bayesian Context Trees State Space Model: Interpretable mixture models for time series. (arXiv:2106.03023v3 [stat.ME] UPDATED)
    A general hierarchical Bayesian framework is introduced for mixture modelling of real-valued time series, including a collection of effective tools for learning and inference. At the top level, a discrete context (or `state') is extracted for each sample, consisting of a discretised version of some of the most recent observations preceding it. The set of all relevant contexts are represented as a discrete context tree. At the bottom level, a different real-valued time series model is associated with each context (i.e., with each state). This defines a very general framework that can be used in conjunction with any existing model class to build flexible and interpretable mixture models. We introduce algorithms that allow for efficient, exact Bayesian inference; in particular, the maximum a posteriori probability (MAP) model, including the relevant MAP context tree, can be identified exactly. These algorithms can be updated sequentially, facilitating efficient online forecasting. The utility of the general framework is illustrated in detail when autoregressive (AR) models are used at the bottom level, resulting in a nonlinear AR mixture model. Our methods are found to outperform several state-of-the-art techniques on both simulated and real-world data from economics and finance, both in terms of forecasting accuracy and computational requirements.
    Data Augmentation for Compositional Data: Advancing Predictive Models of the Microbiome. (arXiv:2205.09906v1 [stat.ML])
    Data augmentation plays a key role in modern machine learning pipelines. While numerous augmentation strategies have been studied in the context of computer vision and natural language processing, less is known for other data modalities. Our work extends the success of data augmentation to compositional data, i.e., simplex-valued data, which is of particular interest in the context of the human microbiome. Drawing on key principles from compositional data analysis, such as the Aitchison geometry of the simplex and subcompositions, we define novel augmentation strategies for this data modality. Incorporating our data augmentations into standard supervised learning pipelines results in consistent performance gains across a wide range of standard benchmark datasets. In particular, we set a new state-of-the-art for key disease prediction tasks including colorectal cancer, type 2 diabetes, and Crohn's disease. In addition, our data augmentations enable us to define a novel contrastive learning model, which improves on previous representation learning approaches for microbiome compositional data. Our code is available at https://github.com/cunningham-lab/AugCoDa.
    Mean-Field Analysis of Two-Layer Neural Networks: Global Optimality with Linear Convergence Rates. (arXiv:2205.09860v1 [cs.LG])
    We consider optimizing two-layer neural networks in the mean-field regime where the learning dynamics of network weights can be approximated by the evolution in the space of probability measures over the weight parameters associated with the neurons. The mean-field regime is a theoretically attractive alternative to the NTK (lazy training) regime which is only restricted locally in the so-called neural tangent kernel space around specialized initializations. Several prior works (\cite{mei2018mean, chizat2018global}) establish the asymptotic global optimality of the mean-field regime, but it is still challenging to obtain a quantitative convergence rate due to the complicated nonlinearity of the training dynamics. This work establishes a new linear convergence result for two-layer neural networks trained by continuous-time noisy gradient descent in the mean-field regime. Our result relies on a novelty logarithmic Sobolev inequality for two-layer neural networks, and uniform upper bounds on the logarithmic Sobolev constants for a family of measures determined by the evolving distribution of hidden neurons.
    Graph Neural Networks Are More Powerful Than we Think. (arXiv:2205.09801v1 [cs.LG])
    Graph Neural Networks (GNNs) are powerful convolutional architectures that have shown remarkable performance in various node-level and graph-level tasks. Despite their success, the common belief is that the expressive power of GNNs is limited and that they are at most as discriminative as the Weisfeiler-Lehman (WL) algorithm. In this paper we argue the opposite and show that the WL algorithm is the upper bound only when the input to the GNN is the vector of all ones. In this direction, we derive an alternative analysis that employs linear algebraic tools and characterize the representational power of GNNs with respect to the eigenvalue decomposition of the graph operators. We show that GNNs can distinguish between any graphs that differ in at least one eigenvalue and design simple GNN architectures that are provably more expressive than the WL algorithm. Thorough experimental analysis on graph isomorphism and graph classification datasets corroborates our theoretical results and demonstrates the effectiveness of the proposed architectures.
    What's the Harm? Sharp Bounds on the Fraction Negatively Affected by Treatment. (arXiv:2205.10327v1 [stat.ME])
    The fundamental problem of causal inference -- that we never observe counterfactuals -- prevents us from identifying how many might be negatively affected by a proposed intervention. If, in an A/B test, half of users click (or buy, or watch, or renew, etc.), whether exposed to the standard experience A or a new one B, hypothetically it could be because the change affects no one, because the change positively affects half the user population to go from no-click to click while negatively affecting the other half, or something in between. While unknowable, this impact is clearly of material importance to the decision to implement a change or not, whether due to fairness, long-term, systemic, or operational considerations. We therefore derive the tightest-possible (i.e., sharp) bounds on the fraction negatively affected (and other related estimands) given data with only factual observations, whether experimental or observational. Naturally, the more we can stratify individuals by observable covariates, the tighter the sharp bounds. Since these bounds involve unknown functions that must be learned from data, we develop a robust inference algorithm that is efficient almost regardless of how and how fast these functions are learned, remains consistent when some are mislearned, and still gives valid conservative bounds when most are mislearned. Our methodology altogether therefore strongly supports credible conclusions: it avoids spuriously point-identifying this unknowable impact, focusing on the best bounds instead, and it permits exceedingly robust inference on these. We demonstrate our method in simulation studies and in a case study of career counseling for the unemployed.
    Causal Discovery and Injection for Feed-Forward Neural Networks. (arXiv:2205.09787v1 [cs.LG])
    Neural networks have proven to be effective at solving a wide range of problems but it is often unclear whether they learn any meaningful causal relationship: this poses a problem for the robustness of neural network models and their use for high-stakes decisions. We propose a novel method overcoming this issue by injecting knowledge in the form of (possibly partial) causal graphs into feed-forward neural networks, so that the learnt model is guaranteed to conform to the graph, hence adhering to expert knowledge. This knowledge may be given up-front or during the learning process, to improve the model through human-AI collaboration. We apply our method to synthetic and real (tabular) data showing that it is robust against noise and can improve causal discovery and prediction performance in low data regimes.
    Breaking the $\sqrt{T}$ Barrier: Instance-Independent Logarithmic Regret in Stochastic Contextual Linear Bandits. (arXiv:2205.09899v1 [stat.ML])
    We prove an instance independent (poly) logarithmic regret for stochastic contextual bandits with linear payoff. Previously, in \cite{chu2011contextual}, a lower bound of $\mathcal{O}(\sqrt{T})$ is shown for the contextual linear bandit problem with arbitrary (adversarily chosen) contexts. In this paper, we show that stochastic contexts indeed help to reduce the regret from $\sqrt{T}$ to $\polylog(T)$. We propose Low Regret Stochastic Contextual Bandits (\texttt{LR-SCB}), which takes advantage of the stochastic contexts and performs parameter estimation (in $\ell_2$ norm) and regret minimization simultaneously. \texttt{LR-SCB} works in epochs, where the parameter estimation of the previous epoch is used to reduce the regret of the current epoch. The (poly) logarithmic regret of \texttt{LR-SCB} stems from two crucial facts: (a) the application of a norm adaptive algorithm to exploit the parameter estimation and (b) an analysis of the shifted linear contextual bandit algorithm, showing that shifting results in increasing regret. We have also shown experimentally that stochastic contexts indeed incurs a regret that scales with $\polylog(T)$.  ( 2 min )

  • Open

    What do you think of this approach to help avoid overfitting? (Playing with episode length, start and end)
    Hi all, I'm working on applying RL to a time series environment. I have a limited (and relatively small, only ~30K rows) dataset for the agent to work through, and the data cannot be simulated. So basically, agent takes a step, and the new observations/state is determined by the next row of the tabular dataset. I have reached a point of overfitting - in-sample performance of the RL agent continues increasing, but OOS performance is decreasing. I assume a big part of the reason for this is that the agent is somehow "memorizing" its training data. Here's the thing. At the moment, I am using ALL of the training data as one episode (so once it completes one episode having gone through ALL of the data, it restarts). What if I made an episode length equal to, say, 1/10th of my data and then I start each episode at a random point in the time series? I am thinking that this way, I can have some kind of "randomness" in my otherwise deterministic environment, and perhaps this way I can force the agent to not "memorize" the training data? I am new to RL, so any feedback on this general idea would be greatly appreciated! ​ EDIT: I have just tried this out, and while it's hard to tell (because of random fluctuations), using this technique DOES appear to have positive effect on the out-of-sample performance! Again though, thoughts, ideas and feedback greatly appreciated. submitted by /u/VladimirB-98 [link] [comments]  ( 1 min )
    Need help with PettingZoo
    Hello everyone, I am working on a custom environment using PettingZoo for multi agent reinforcement learning. I managed to finish the environment, however now I don't know how to create the model. I also trained to create a model using stable_baselines3 PPO, however i get the error "AttributeError: 'functools._lru_cache_wrapper' object has no attribute 'shape'". I tried to follow the example from here, but I cannot install supersuit with pip on windows 10 and I am not sure if it is needed. Any help, be it examples or advice, will be very much appreciated. submitted by /u/Iltavil [link] [comments]  ( 1 min )
    can reinforcement learning be used for unlabled time series data clustering?
    submitted by /u/Affectionate_Worth43 [link] [comments]  ( 1 min )
    How should one interpret these PPO diagnostic training plots?
    ​ https://preview.redd.it/9in8206mu0191.png?width=1453&format=png&auto=webp&s=e221eead27f4e781a13e34586bc3ad87d13e7810 So here I have four diagnostic plots for PPO training on a Gym CustomEnv. I have many questions regarding how to interpret them. Although, if there is anything you guys think is interesting/insightful regarding these graphs I would love to hear them. Also, it might be useful to know that the training was indeed successful for this run, and the mean episode reward was (more or less) consistently improving. 1) What does an increasing clip fraction indicate? 2) What does an increasing KL divergence indicate? 3) Why the policy gradient loss go above 0? Wouldn't this mean that the policy should be getting worse? In this case the policy continues to improve even after getting this positive loss. 4) Same as question 3 but for entropy loss. Any help whatsoever will be great. Im quite at a loss. Thanks. submitted by /u/C_BearHill [link] [comments]  ( 1 min )
  • Open

    [D] matrix profile distance measure characterization
    If there are various types of distances measures for time series, such as Euclidean, DTW, and shape-based ones, how can we characterize the matrix profile distance measure? Profiling one? submitted by /u/jiii95 [link] [comments]
    How to create a basic version of DALL-E from scratch? [P]
    I've been searching online for tutorial on how to create a text-to-image generator algorithm (a very basic version of DALL-E). I'd like to create the algorithm line by line rather than just copy some code from online & just train that on my images. Basically, I'm trying to learn how these sorts of algorithms work by coding one from scratch. So, does anyone know of any video or online article tutorials for creating and training a basic text-to-image generator with python? Thanks! submitted by /u/Special_Treacle4452 [link] [comments]  ( 1 min )
    [P] Gradio Demo for "Story and Video Generation" using GPT-J, Latent Diffusion, and FILM
    submitted by /u/Shikanomiya [link] [comments]
    This is how you can turn yourself into an immortal machine. [N]
    submitted by /u/Defiant_Swann [link] [comments]
    [r] How to train a neural network with a loss function based on cumulative AUC of multiple inputs
    Hello, I have almost no experience with neural networks, besides doing a few examples in books. Basically the problem is I have 1000 cases of which 100 are true and 900 are false. Each case has a score associated with it composed of multiple weighted sub-scores (Overall Score = W1 * Score 1 + W2 * Score 2 ... + WN * Score N) The cases are then sorted on their overall score and a ROC curve is generated, by tracking the running percent of positives identified vs percent of negatives identified as you follow through the rank ordered list of scores. I want to create a single set of input weights for sub scores which when applied to all test cases produces the maximal AUC. Any input is appreciated as I have no idea where to start. Thanks! submitted by /u/Fckcensorship25 [link] [comments]  ( 1 min )
    [D] Cheaper gpu cloud alternative to gradient (paperspace)
    Are there any cheaper subscription models like gradient? currently i pay 10 dollars a month with access to rtx 5000 or p5000 for 6 hours at a time. after which you will have to restart the instance, but the data is saved. are there any better alternatives to this? Also, I'm student in case there is a student discount. submitted by /u/Knightron2525 [link] [comments]  ( 1 min )
    [D] Machine Learning - WAYR (What Are You Reading) - Week 138
    This is a place to share machine learning research papers, journals, and articles that you're reading this week. If it relates to what you're researching, by all means elaborate and give us your insight, otherwise it could just be an interesting paper you've read. Please try to provide some insight from your understanding and please don't post things which are present in wiki. Preferably you should link the arxiv page (not the PDF, you can easily access the PDF from the summary page but not the other way around) or any other pertinent links. Previous weeks : 1-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100 101-110 111-120 121-130 131-140 Week 1 Week 11 Week 21 Week 31 Week 41 Week 51 Week 61 Week 71 Week 81 Week 91 Week 101 Week 111 Week 121 Week 131 Week 2 Week 1…  ( 1 min )
    [P] PyTorch M1 GPU benchmark update including M1 Pro, M1 Max, and M1 Ultra after fixing the memory leak
    If someone is curious, I updated the benchmarks after the PyTorch team fixed the memory leak in the latest nightly release May 21->22. The results are quite improved: ​ https://preview.redd.it/5dkat9hoi3191.png?width=2637&format=png&auto=webp&s=dc42ee03167dd3aefbd0319061994bfc2ff24dab For a more detailed write-up please see https://sebastianraschka.com/blog/2022/pytorch-m1-gpu.html submitted by /u/seraschka [link] [comments]  ( 1 min )
    [D] Generating fake research papers with GPT-3
    Hi! I wanted to generate a fake research papers with GPT-3. I'm not sure how to do it in a cost-efficient way: should I fine tune the model with custom examples? Since I am looking to generate the entire paper, I'm not sure how to maintain the context in different parts of the research paper (abstract, introduction, ....., conclusion). I was thinking of two paths: (1) Take an existing research paper. Feed it into GPT-3 to reword the entire research paper. (2) Take the title of an existing research paper. Make GPT-3 generate the abstract. Use the abstract to generate the introduction. Use the introduction to generate the description, and so on. I was planning to fine tune the GPT-3 model by breaking down existing research papers into (prompt: title, prompt: rest of the paper) Is there a better way to generate entire fake research papers? submitted by /u/mimeticaware [link] [comments]  ( 1 min )
    [D] Explainable AI
    Does anyone know papers, where experts are asked to interpret the results of explanation algorithms like SHAP and LIME? In the best case on timeseries data, where the explanation algorithm highlights which time points lead to decisions of models. submitted by /u/Bananymous97 [link] [comments]  ( 1 min )
    [R] HYDRA is the first spatial perception engine that builds a 3D scene graph (geometry and semantics) from sensor data in realtime
    submitted by /u/SpatialComputing [link] [comments]  ( 2 min )
    [D] EMNLP 2022 and ARR June 1
    Hi there! We are planning to submit a paper to the June 1 ARR deadline and commit to EMNLP 2022 later. I wonder if reviews will be out before July 24, the ARR-to-EMNLP commitment deadline. Are you submitting to EMNLP directly or via ARR this year? ​ Thanks in advance! Wishing you all the luck with paper submissions! submitted by /u/bunsenfeng [link] [comments]  ( 1 min )
    [D] which of these is a better ml strategy
    Using R here - set seed, divide data into train and test only once. Use repeated cross validation on the train data to select the best model out of all the algorithms you're interested in. Finally - use the best performing model on the test data and report the performance metrics. Divide the data into train and test say 10 times by changing the seed. Train model on each train data and report the performance metrics via predict function on each of the test data - now use the performance metrics of the test data to choose the model?? I am trying these strategies because I have very limited 2 sets of biological data with approx n × p ratio of 1 : 7 and 2:7. So the performance when changing the seeds on the test set changes drastically like R2 values varies from negative to 0.50 submitted by /u/triary95 [link] [comments]  ( 2 min )
    [D] GNN Architecture that inputs and outputs both edge and node features?
    I am looking for a GNN architecture/layer that takes in both the node and edge features and outputs both node and edge features too. The only one I could find is the "MetaLayer" in torch geometric from the paper "Relational inductive biases, deep learning, and graph networks" but that is from 2018. There must be something else and new out there. Any links to papers and/or implementations would be appreciated! submitted by /u/Hobo-Wizzard [link] [comments]  ( 1 min )
    [D] Simple Questions Thread
    Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]  ( 1 min )
    [R][P] A Python package for unsupervised mix data types clustering
    Problem statement Clustering is an unsupervised machine learning approach for clustering unlabeled data. Unfortunately, most clustering methods can only operate with data that has either numeric or category properties. This is a significant issue since most real-world datasets will contain multiple types of features. Solution Today I am releasing the first version of mixclu package for mix data clustering. Please check it out and give it a star if you like it.Mixclu (Mix Clustering ) is an open-source python package for doing unsupervised mix data types clustering. This package puts together a variety of combination models, including kmeans-one-hot, Gower distance, umap, etc. The goal is to provide an easy-to-use implementation for each algorithm. The package is still in progress, and I plan to add deep learning-based methods such as autoencoder and other techniques. I also plan to implement a few papers based on mixed clustering problem statements.Package link: https://github.com/monk1337/Mixclu Actively Looking for contribution in the following task : Implement paper: Affinity Learning for Mixed Data Clustering ( I do have Matlab code & data from the article's author. If anyone wants to contribute and convert the Matlab to python code, please feel free to contact me. Implement paper: A Multi-View Clustering for Mixed Data ( I do have java code & data from the author of the article, please feel free to contact me for contribution ) More features and suggestions are welcome :) submitted by /u/aadityaura [link] [comments]  ( 1 min )
    [Discussion] Name of a (possibly non-existent) paper stating vision-language pretraining can improve performance on text-only tasks?
    I'm trying to remember the name of a paper I think I read, though there's a chance this paper doesn't exist and I'm just mixing it up with something else. In the off-chance I'm remembering it correctly, I think it trained a vision-language model and showed that it outperformed a unimodal text-only model on language tasks? So the idea of the paper was vision-language pretraining doesn't just help in vision-language tasks, but actually improves performance on text-only tasks. If this paper is real, does anyone remember what it was called or how I can find it? submitted by /u/BurnerAccount1100 [link] [comments]  ( 1 min )
    [D] machine learning on sequential data with gaps by design
    hi, in my research, we’re doing binary classification on sequential data. some of the sequences have gaps, or start and end at different points, and this is intentional, since the values are normalized flux at different wavelengths and many points are missing due to absorption features. it is important to know where the missingness occurs in each sequence. essentially, the gaps are intentional and this has created a lot of inconsistency and variance in the dataset, but the points can’t be interpolated. i’ve kind of hit a wall after attempting to do manual feature extraction (how scattered the non-missing data are sat different wavelengths,whether there is a large gap in the middle, etc.). how should i handle this missing-by-design data? is it possible to pad the sequences or fill in the gaps with -99? will a standard RNN still work? thank you in advance. submitted by /u/queenofthekyriarchy [link] [comments]  ( 1 min )
    [D] Clustering high dimensional
    I have a dataset containing multiple cancer cell lines (rows). The dataset features/columns are different genes, where each gene has a value based on their PDUI (polyadenylation site usage index) score (between 0 and 1). The dataset has 53 rows (cell lines) and 12,500 columns (genes). For each cell line, I also have an IC50 value (taken from a different dataset) which shows resistance/sensitivity to an anti-cancer drug being developed. I would like to cluster the cell lines considering all their genes' PDUI scores and then create a plot against IC50, to see if there are any relationships. A problem is that there are some genes where the majority of the cancer cell lines (some cases 99%) have missing-N/A values (as they do not have that gene). This leads to a large portion of the dataset being filled with N/A values. To deal with that I drop every column where more than 20% of the samples do not have a value. For the remaining columns (20% or less N/A values), I use a KNN imputer to fill out their missing values based on their 5 nearest neighbours. This results in approximately 6,000 columns/features remaining with no missing values. What algorithm do you think will perform best for clustering such a high-dimensional dataset? How would you approach such a task? I have of course tried to reduce the dimensions using PCA, reducing the dataset to just 50 features (whilst keeping a 98.8% explained variance ratio). I then tried using HDBSCAN but the algorithm failed to find any clusters. I am open to any suggestions and would greatly appreciate any discussion! submitted by /u/Rafaelkoll [link] [comments]  ( 5 min )
    [P] GPT-NeoX Playground
    We deployed 20B parameter GPT-NeoX model by #EleutherAI for anyone who wants to try it out. (some features are disabled on mobile). GPT-NeoX is 20 Billion parameter language model, which is open-sourced by EleutherAI. Link: https://neox.labml.ai/main Features that I like to highlight here, You can pick random sampling, nucleus sampling, greedy sampling or a beam search. Hover over tokens to see alternatives and the predicted probabilities. Outputs are shown as it receives in the UI. You can find our simple PyTorch annotated re-implementation of GPT-NeoX here. We love to hear your feedback and suggestions. Thank you all, and I appreciate the support. submitted by /u/hnipun [link] [comments]  ( 1 min )
    [P] TF\Keras + GA: is there a "best" one?
    i use tF\Keras + "KerasGA (PyGad)" for 10-30 parameter (climate) models, up to now with not so much success. So I want to try other (Python -based) GA packages, like "Evolutionary Keras" and "Keras+DEAP"... My questions: Are there other powerful GA packages for Keras, which I should try? Is there something like a "best" package for complex, "difficult" problems? submitted by /u/rudel_s [link] [comments]  ( 1 min )
    [D] Why is the skip-gram model used in DeepWalk and Node2vec rather than CBOW?
    I am curious to understand the reason behind the decision to use Skip-gram rather than CBOW for these two models. According to the original Word2vec paper, CBOW is faster to train and captures syntactic similarities better whereas the skip-gram is slower at training but captures more robust semantic similarities and is also better at handling infrequent words. How does this apply to graph theory and what motivated this decision? DeepWalk: https://dl.acm.org/doi/abs/10.1145/2623330.2623732 Node2vec: https://dl.acm.org/doi/abs/10.1145/2939672.2939754 submitted by /u/LanverYT [link] [comments]  ( 1 min )
    [R] [P] Train dataset wih similar objects too close. Do I have to keep only one to detect with bounding boxes?
    Hi guys, I have a training data with many similar objects close together at the scene, the objects are too wide, then if I remove the others from the train image keeping only one, the train image will be too wide then. I'm asking because I will use bounding boxes to recognize the object to count pourpose. Can I keep only one image and fill the rest of the image with black to maintain a square aspect ratio? Probably I willl have to do some data augmentation too (rotating). Thanks! submitted by /u/MrWrodgy [link] [comments]  ( 2 min )
    [R] Happy to share our latest Research paper : MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering
    I am pleased to share the news with the ML community that the team I work for recently released a new benchmark: MedMCQA, a new large-scale, Multiple-Choice Question Answering (MCQA) dataset designed to address real-world medical entrance exam questions. Our paper was accepted at Conference on Health, Inference, and Learning (CHIL) 2022 and published in Proceedings of Machine Learning Research (PMLR). ​ MedMCQA sample questions ​ The main contributions: ​ MedMCQA has More than 194k high-quality Medical entrance exam MCQs. Dataset requires a deeper domain and language understanding as it tests the 10+ reasoning abilities of a model across a wide range of medical subjects & topics. It Covers 2.4k healthcare topics and 21 medical subjects with an average token length of 12.77, the …  ( 2 min )
  • Open

    Apple’s former ML director reportedly joins Google DeepMind
    submitted by /u/KendraMontgomery [link] [comments]
    Hack3: The Leading Online Hackathon for High Schoolers!
    ​ https://preview.redd.it/lmy70wdji3191.png?width=1270&format=png&auto=webp&s=a7737c6828ea94682d0dcef58f7afb391bcba0d0 Attention to curious high schoolers! Hack3 is hosting an online hackathon for high schoolers for 24 hours on June 25-26. In 2020, we connected nearly 300 students of all skill levels, to learn to build innovative projects that positively impacted the world. Over 100 attended our free classes led by industry professionals to learn new skills . Over twenty mentors were in our help desk to help participants when they needed help. Last year, our judges, mentors, and workshop instructors were affiliated with the likes of Stanford, Harvard, Amazon, NetApp, and Wikipedia. In 2021, we connected over 350 students of all skill levels, to learn to build innovative projects that positively impacted the world. Over 150 attended our free classes led by industry professionals to learn new skills. Over 30 mentors were in our help desk to help participants when they needed help. Last year, our judges, mentors, and workshop instructors were affiliated with the likes of Amazon, NetApp, Balsamiq, Nexus Bytes, Replit, Postman, and Wolfram Language. This year, with the lessons learned from 2021, we aim to host a competition consisting of over 500 participants, while targeting the underprivileged communities around the world. To help achieve our goal of providing a learning opportunity for everyone, we will be sponsoring internet access to those who need it to truly level the playing field for all. Are you down? Register on DevPost here. submitted by /u/thegreatestgemini [link] [comments]  ( 1 min )
    General AI through scaling? Meta's AI chief Yann LeCun speaks out: "We have a number of obstacles to clear, and we don’t know how."
    submitted by /u/much_successes [link] [comments]  ( 1 min )
    AI book reading tool
    Greetings Folks, In the past year, we had released a book reading AI tool to search for content within files using natural search, and we had received constructive feedback from the community. Today we are releasing, a new updated version with a fresh UI overhaul (desktop support).https://rastero.io/intro Glad to share it with you all. Here's an account to try it out! username: reddit password: reddit2022 https://reddit.com/link/uvhgx5/video/7b49kj1jq2191/player submitted by /u/deep_ak [link] [comments]  ( 1 min )
    HYDRA is the first spatial perception engine that builds a 3D scene graph (geometry and semantics) from sensor data in realtime
    submitted by /u/SpatialComputing [link] [comments]  ( 1 min )
    10 Ways AI Will Change The World By 2050
    By 2050, AI will reach remarkable advancements that will be beyond many people's wildest dreams. Robots will not only be able to attain, but also generate, that task in a cost-effective, timely, and meticulous manner, hence increasing efficiency. Read more submitted by /u/ridamughal110 [link] [comments]  ( 1 min )
    Should AI Be Democratized?
    AI is considered a “general-purpose technology” due to its pervasive role in various industries. Many researchers argue that AI should be employed just in highly-impact domains, while others advocate the consensus to democratize AI so that the benefits are not restricted to a small group of people. Read more submitted by /u/ridamughal110 [link] [comments]  ( 1 min )
    AI Dream 53 - Cosmic Creation | MASTERPIECE TEASER
    submitted by /u/LordPewPew777 [link] [comments]
    Is there a AI which is able to combine 2 or more images?
    Or is there a AI which I can use to combine 2 images: like a dragon from one image and a background from a other images etc.? submitted by /u/xXNOdrugsForMEXx [link] [comments]
    How to Control an AGI via Motivation Selection
    My Dear AI Fellows, Please check out my latest video about how to control an AGI via Motivation Selection: https://youtu.be/rLB4xkwgEAw I also have a lot of great content on the channel regarding life 3.0, building an AGI, AGI Safety, etc. Please check them out and subscribe to my channel! submitted by /u/billgggggg [link] [comments]
    Greg Brockman on Twitter - I also really like this one, created by the prompt "DALL-E dreaming of becoming an AGI"
    submitted by /u/mofosyne [link] [comments]  ( 1 min )
    Building an ACOG (part 1) (artificial cognitive entity)
    submitted by /u/DavidKShapiro [link] [comments]
    2029 AGI really tho?
    Tell me, is that far fetched? I had an argument with a family member about when AGI arrives and if it even arrives at all. Is 2029 a reasonable date? I think so, but I need some backup submitted by /u/Ashamed-Asparagus-93 [link] [comments]  ( 2 min )
    For games - A lot of Artificial Intelligences are just really deep exploration trees. Not learning anything intuitive like a human
    This has been bugging me for a bit. When it comes to Chess - Stockfish is powerful because it can search incredibly deep. It doesn't 'look' at a position and 'think' a move is good. Maybe Alphazero is slightly different - as that has the deep search plus the evaluation? I can't remember exactly tbh. Poker AIs are just very deep brute force approaches too. Getting experience through tonnes of simulations. AlphaStar has to be different - this does learn general principles as the games are far too different each time. AlphaGo - I don't think I have the knowledge to comment on this one, but from the Chess and Poker ones, I would assume it's just a very deep search, but not 100% sure if anyone can clarify. Been a while since I watched the doc / read about it. I also experimented with a NN that 'solves' tic tac toe. It was a hello world level tutorial. But it didn't do anything special - the neural network was just memorizing the optimal solution for every position. When I expanded the game from TTT to connect 4, I realised it wasn't 'learning' anything from new positions. It was just memorizing every position - not impressive. Is this basically where we are at with AIs? They just have to play / simulate millions - trillions of times to outperform humans? submitted by /u/Cwlrs [link] [comments]  ( 1 min )
  • Open

    What strategies do you use to normalize data like distances to objectives?
    Hey, I am exploring neural networks and wondering how people handle unknown data ranges like distance. I'm not sure if I want to decide on an arbitrary maximum distance and normalize for that or if there's a standard way to deal with these kinds of things. One idea I had is to not use the actual distance but instead give a normalized value of the significance. If something is beyond a certain distance then it has an influence of 0 and if it's something like the creature is about to walk into a fire then the significance would be close to 1. Does that make sense? I've been seeing interesting evolution simulators (no training, occasional random changes and connections between neurons) and wanted to try my own. I made a creature with an x coordinate and known distance to a trap. I wasn't sure how to normalize the x coordinate or distance at large values so I thought of my 2nd idea there instead. Maybe that makes more sense for a sensory input anyway? Open to ideas or lexicon that will help me find an answer, thanks! submitted by /u/Tomnnn [link] [comments]  ( 1 min )
    Overfitting in first epochs
    Hey everyone, I am training a CNN network on the DVS gesture dataset using PyTorch. However, the training is not progressing in a soft way, the accuracies of both training and validation fluctuate a lot, they are both progressing, but there is a big difference between them (5~6% up to 10%) as if there is overfitting in 3/4 epoch. I have tried L2 regularization as well as a dropout with high values, the difference disappears in the first iterations but reappears strongly afterward, and I am sure that datasets are perfectly merged and split randomly. PS: May this be an underfit, how to identify an underfit ? Thanks in advance! submitted by /u/StartFinancial5917 [link] [comments]  ( 2 min )

  • Open

    Can I somehow enhance the quality of this image
    I dont know if this the right subreddit to post on but I want to enhance this image https://preview.redd.it/pq435gmd3x091.jpg?width=626&format=pjpg&auto=webp&s=5a01c6de7402dcb8d83e069cc5410a6c9c9d9017 submitted by /u/AdnanZXgamer [link] [comments]
    Without fail 😁🔥
    submitted by /u/p0goniphaft111 [link] [comments]
    who is the Godfather of AI?
    Howcome Krishnamurti said in the year 1981 that 'the computer can do ANYTHING a human being can do'? .. at a time when personal computers where very slow compared to today This might be a longer rant about these questions and offers a perspective where I suggest answers, as well as regaining control of a territory once lost: The mind. Lets start with the almost three hour long documentary by Adam Curtis called Hypernormalisation, in which he describes how governments no longer attempt to deal with the complexity of the real world, but only to the models created. What should have been a map describing the world, assumed a primary role for actions and decisions. Reality becomes 'something you handle' and perception management of the masses became important. The proces described by Ada…  ( 2 min )
    Can we train AI to capture the process of scientific evolution, and then fast-forward it to obtain future technology?
    Partly inspired by this article: https://www.quantamagazine.org/machine-scientists-distill-the-laws-of-physics-from-raw-data-20220510/, which describes AI that discovers new biology/physics equations from raw data. My question is: humans have come a long way from throwing stones to having all the technologies today. This thousands of years of evolution is a process in which new knowledge is developed from existing knowledge–countless cycles of observation, experimentation, and conclusion. Is it possible then, to train an AI to capture this process of generating new knowledge from existing knowledge, and use this AI to fast-forward scientific evolution, thus quickly obtaining future technology that would otherwise take decades to develop? submitted by /u/Independent_Ant_2027 [link] [comments]  ( 1 min )
    Pretty roses.
    submitted by /u/cookingandcraft [link] [comments]
    GATO outperformed experts on 450 tasks out of 604. A bold step towards an AGI
    submitted by /u/imapurplemango [link] [comments]  ( 1 min )
    What website to make image like this just with prompt?
    submitted by /u/Due-Ad9795 [link] [comments]
    How Uber uses AI to serve you better
    submitted by /u/OnlyProggingForFun [link] [comments]
    Apple Executive Who Left Over Return-to-Office Policy Joins Google AI Unit
    submitted by /u/bartturner [link] [comments]
    A Look At 3 Big Risks With AI
    https://www.youtube.com/watch?v=0kEqqP8PlUw submitted by /u/kbf_ [link] [comments]
    Best Natural Language Processing Books for Beginners to read in 2022
    submitted by /u/maneesh123456 [link] [comments]
    I'm looking for AI image generators that make art based on the image(s) you provide
    Hi, instead of text-to-image AI generators, I'm wondering if there are any alternatives to The Looking Glass AI. The Looking Glass AI asks you for an image, to generate images based on them and a theme input. I tried looking for alternatives myself but I was only able to come across text-to-image generators so far. submitted by /u/DTanya [link] [comments]  ( 1 min )
  • Open

    Research that may interest those working on the development of AGI (it identifies how consciousness/cognition evolved in living processes)
    A number of key challenges stand in the way of the development of HLAI and AGI. However, overcoming them has been handicapped by a lack of relevant progress in the human sciences: they have failed to produce clear understandings of how some key functions that are absent in current AI are enabled in humans. A paper just published in the journal BioSystems deals with some of these key issues: for example, it identifies at a cybernetic/systems level how consciousness, 'real-time' sensorimotor coordination, and mental modelling/cognition arose and developed in living organisms. These insights might prove useful in identifying how these functions could be instantiated and developed in AI. Titled 'The evolution and development of consciousness: the subject-object emergence hypothesis', the fu…  ( 2 min )
    What is the Alpha go training time measured in human time?
    I was listening to a talk where it was mentioned that the Alpha go training time in human time is 56 years. But there was no reference for that information. I'm wondering whether it's true, and how this was calculated. Any leads are appreciated, thanks! submitted by /u/exenson [link] [comments]  ( 1 min )
    Is there a way to sample a constant action for the whole episode?
    Hi, Usually actions are sampled for each step and applied to the agent. However, is there a way to sample an action that remains constant until the episode ends? I am using Isaac Gym. submitted by /u/Fun-Moose-3841 [link] [comments]  ( 1 min )
    PPO learns how to play browser-based game
    Hi everyone, From past years, I had a JavaScript game I wrote as a fun side-project. Now that I'm studying reinforcement learning, I wanted to see if I could code PPO into the game so that it could train *directly in the browser* without needing Python or any other dependencies... Well, I couldn't find any implementations of PPO in JavaScript, so I decided to code it into the game myself. I'm really happy with the end result - I've even included a pretrained agent that can autoplay the game live! (it beats the game ~85% of the time 😅) Source code: https://github.com/hmomin/ppo-winter-run Game: https://winter-run.com Hope you enjoy! I'm wondering if you all have any interest in a PPO implementation for JavaScript/TypeScript? If so, I may rework the source code a bit to play more easily with gym-like environments and release it as a separate project in a future post... submitted by /u/hmomin [link] [comments]  ( 1 min )
    Recommendation system with limited data?
    Hi! I’m attempting to builder a recommendation system using reinforcement learning for a side project. I was wondering if there are algorithms that work well with limited data? Thanks I’m advance! Edit: also wondering if there are any systems that are trained by reaching a goal(beating an opponent in games) and are used in a different field(not gaming) submitted by /u/brioche789 [link] [comments]  ( 1 min )
    12 Best Courses to Learn Deep Learning
    submitted by /u/MlTut [link] [comments]
  • Open

    [R][P] Gradio Web demo for StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators (SIGGRAPH 2022)
    submitted by /u/Illustrious_Row_9971 [link] [comments]
    Deepfakes - How Long Do They Take to Make? [D]
    How long do deepfakes take to make Example scenario I make a 1 minute video where I'm talking to the camera I want to swap out my face for someone else I want it to be realistic (So people wouldn't be able to know its not my face) Questions Best way to achieve this? How long does it take? Easiest way to achieve this? Is DeepfaceLab the best option? submitted by /u/AviatorPrints [link] [comments]  ( 1 min )
    [N] Surgical Tracking Challenge
    We, at the Hamlyn Centre - Imperial College London, will be hosting a first of its kind Tracking Challenge for Surgery at MICCAI 2022. If you are interested, please visit: https://surgt.grand-challenge.org submitted by /u/aweld20 [link] [comments]
    [D] Looking for examples of failures of correlation-based NLP
    I am looking for examples (eg screenshots of dialogues) of using recent NLP models and getting unintuitive results because the model infers a wrong causal relationship between two words because they occur together often. submitted by /u/dr_cosmicomical [link] [comments]
    [D] Character-level vs. word-level tokenization
    Hi all, I'm relatively new to the field of NLP and while reading a blog post from 2015 The Unreasonable Effectiveness of Recurrent Neural Networks by Andrej Karpathy, I was wondering about this part of the "Further Reading" section: Currently it seems that word-level models work better than character-level models, but this is surely a temporary thing. Aren't most state-of-the art models these days using some kind of vocabulary, i.e. whole words or at least sub-words? Text in the wild can be full of typos, emojis or other unicode crazyness, so wouldn't training all these LLMs on a character level make them more flexible and better applicable to real life problems? I'd love to hear your opinions about this and to be pointed towards good resources to learn more about different tokenization methods and their limitations and performance implications. Cheers. submitted by /u/CodeAllDay1337 [link] [comments]  ( 2 min )
    [Project] How to create visualizations for "complex" networks?
    For a presentation I need to make my own visualization of a graph neural network with encoders and other additional parts. I would like to get something like this: ​ https://preview.redd.it/yi1erwam1t091.png?width=868&format=png&auto=webp&s=adabbe066182b2065b835ea50e8fc83c3574910a Does anyone know a convenient way to do this? submitted by /u/mr_birrd [link] [comments]  ( 2 min )
    [N] Introducing PeerXiv - A modern platform for peer-review of preprints
    (Check out the Twitter thread) What would a peer review process look like if it was designed today? Peer review is one of the cornerstones of the research community, and yet while our community keeps advancing and growing, the reviewing process remains almost unchanged. We strongly believe that peer review can be so much better for both authors and reviewers and we are excited to share PeerXiv, our proposal to do just that. Check out the PeerXiv Mock: https://peerxiv.web.app/about ​ https://i.redd.it/g0nv9iz94s091.gif PeerXiv is a modern platform for the peer review of preprints. Authors can submit their preprints and get feedback directly from a set of anonymous PeerXiv reviewers who earn reputation points for their effort📈 ​ https://preview.redd.it/5rfck4kb4s091.jpg?width=1022&…  ( 2 min )
  • Open

    Decoding a grid square
    I saw a reference last night to the grid square EL29fx and wanted to figure out where that is. There are many programs that will do this for you, but I wanted to do it by hand. I wrote about how grid squares work a year ago, but I was rusty on the details, so […] Decoding a grid square first appeared on John D. Cook.  ( 2 min )
    Exponential of a line
    Draw a line in the complex plane. What is the image of that line when you apply the exponential function? A line through w with direction z is the set of points w + tz where w and z are complex and t ranges over the real numbers. The image of this line is exp(w+ […] Exponential of a line first appeared on John D. Cook.  ( 1 min )
    Discrete derivatives
    If you’ve taken calculus, and someone asks you what the derivative of x5 is, you can say without hesitation that it’s 5x4. Now suppose they come back and say, “I’m sorry. I forgot to give you any context. Here x5 is a polynomial in the field of 343 elements.” It turns out that this additional […] Discrete derivatives first appeared on John D. Cook.  ( 2 min )
  • Open

    Monkey Patching Python Code
    Python is a dynamic scripting language. Not only does it have a dynamic type system where a variable can be assigned to one type first and changed later, but its object model is also dynamic. This allows us to modify its behavior at run time. A consequence of this is the possibility of monkey patching. […] The post Monkey Patching Python Code appeared first on Machine Learning Mastery.  ( 13 min )
  • Open

    Unleash Your Dragon (Remastered 8K 60FPS)
    submitted by /u/stepanmetior [link] [comments]
  • Open

    Keeping Your Company’s Data Model IP
    Robotic process automation (RPA) when it first gained attention five or so years ago disappointed me. Here was yet another stopgap measure, a digital form of tape and baling wire to temporarily join scripts of oft-repeated tasks in a specified workflow sequence across commonly used applications. RPA seemed quite brittle–wouldn’t you have to regenerate the… Read More »Keeping Your Company’s Data Model IP The post Keeping Your Company’s Data Model IP appeared first on Data Science Central.  ( 5 min )

  • Open

    [R] A paper that contributes to the development of AGI by identifying how consciousness/cognition evolved in living processes
    A number of key challenges stand in the way of the development of HLAI and AGI. However, overcoming them has been handicapped by a lack of relevant progress in the human sciences: they have failed to produce clear understandings of how some key functions that are absent in current AI are enabled in humans. A paper just published in the journal BioSystems deals with some of these key issues: for example, it identifies at a cybernetic/systems level how consciousness, 'real-time' sensorimotor coordination, and mental modelling/cognition arose and developed in living organisms. These insights might prove useful in identifying how these functions could be instantiated and developed in AI. Titled 'The evolution and development of consciousness: the subject-object emergence hypothesis', the fu…  ( 2 min )
    [N] Introducing NGC-Learn: Predictive Coding and Neurobiologically-Motivated Learning in Python
    Interested in doing research in neurobiologically-inspired artificial neural networks? Need an open-source, actively maintained tool for reproducing the latest paper on predictive coding or building your own more biologically-faithful neural system? ngc-learn is a recently-released Python library designed in response to these questions. The ngc-learn dynamics simulator is specifically meant for building, simulating, and analyzing arbitrary predictive coding models based on the neural generative coding (NGC) computational framework and theoretically guided by the free energy principle. This toolkit, distributed under the 3-Clause BSD license, is built on top of Tensorflow 2. Notably, ngc-learn's extensible nodes-and-cables system is general and can even be used to build non-predictive cod…  ( 1 min )
    [D] How would one even maintain a generalist agent like Gato?
    DeepMind just published a paper on Gato which is a generalist agent that can perform more than 600+ tasks. I get that there's still a huge debate/discussion about AGI, HLAI, ethical concerns, etc., so such an agent is unlikely to be deployed in production soon, but let's just entertain that idea for a second, how would one even maintain a model like that? If it performs well for all tasks except for one, how would you retrain only for that one tasks that it underperformed? The model uses the same weights and biases for all the tasks, so even if you choose to freeze certain nodes, how would you even go about to do that? submitted by /u/Lexayne [link] [comments]  ( 1 min )
    "[Project]" Brainchop: In-browser deep learning framework for volumetric Segmentation
    ​ https://preview.redd.it/udqgrwbzlo091.png?width=570&format=png&auto=webp&s=d856ab9334e6a0c5ae052c762fdfb8d6e22cfb61 Live Demo: brainchop.org Brainchop is a client-side web-application for automatic segmentation of MRI volumes that brings automatic volumetric segmentation capability to neuroimaging by running a robustly pre-trained deep learning model. The app does not require technical sophistication from the user and is designed for locally and privately segmenting user’s T1 volumes. Results of the segmentation may be easily saved locally after the computation. An intuitive interactive interface that does not require any special training nor specific instruction to run enables access to a state of the art deep learning brain segmentation for anyone with a modern browser (e.g. Firefox, Chrome etc) and commonly available hardware. Additionally, we make implementation of brainchop freely available releasing its pure Javascript code as open-source. https://preview.redd.it/q9q0zq3wlo091.png?width=3333&format=png&auto=webp&s=0110de38b24890373f4640166c53de508969ded0 submitted by /u/Character-Rip-5824 [link] [comments]  ( 1 min )
    What libs/boiler plate/platforms do you use to abstract and optimize your workflow when starting a new project? [D]
    E.g.. New data project, currently investigating the data. I'm joining multiple data sources to create a standard data object represented by X number of features. Now I want to get a sense of the distribution. and skew of each feature (basic stat analysis), do some elementary clustering, and be able to repeat the process as I add, change or remove features. I can write abstractions in code to do these things such as a function that plots the frequency of values for a specific feature + gives mean,mode,median etc but I'm wondering if there are libraries or platforms you all use to avoid this tediousness. I understand that may sacrifice the customizable nature of writing your own code however I'm interested in such things to get a quick sense of the metadata. submitted by /u/gravbeamemitter [link] [comments]  ( 1 min )
    [P] SSO (Single Sign-On) for CVAT, the computer vision annotation tool
    For those who are interested in using CVAT with SSO, previously I made a proof-of-concept video to demonstrate my SSO implementation for CVAT: https://www.youtube.com/watch?v=R7hBBLG5Fdc Now I'm happy to announce that I have submitted my code changes: https://github.com/AlexGaoDW/cvat/tree/feature/datawiza-sso And I've created a PR to get it into the official repo. You can try it out by yourself following the document here: https://docs.datawiza.com/guides/cvat.html I also set up an instance using Google as the identity provider such that you can try SSO functionality with your Google account: https://cvat-sso.datawiza.net/ Enjoy! submitted by /u/alexcgg1 [link] [comments]  ( 1 min )
    [R] Proof Of Useful Work For AI Training On The Blockchain
    submitted by /u/EducationalCicada [link] [comments]  ( 2 min )
    [D] IEEE Access Article accepted as the first author, what's next?
    Dear members, This is my very first experience in submitting to IEEE Access as a first author. My article was accepted after submitting minor edits with a rebuttal. However, in my acceptance email, the reviewer asks me to incorporate his/her comments including revamping the sections, adding elements in figures, more explanations, and adding references. The same email also states that "you are not permitted to add or remove authors or references post-acceptance" and "to take this opportunity to improve the English grammar and check spelling, as the article is only lightly edited before publication". This has become a very confusing situation for me. Anyone having experience in publishing in this journal is requested for guidance, please. submitted by /u/HQ2020 [link] [comments]  ( 2 min )
    [P] Add conditional node to random forest?
    Hi all, Premise that I am pretty new to ML, so pardon me if I say anything wrong. I am wondering if it would be possible to do the following: a modified semi-supervised RandomForest, with a conditional node added as a positive branch to all decision trees which controls for certain features? Basically I would like for the algorithm to check the quantitative level of a feature (dependent variable) in samples positive for the said feature (independent variable). The algorithm has to set a cutoff for the quantitative variable only if the categorical variable is TRUE (such is the case when the quantitative variable is =/ 0). Does it make sense? How can I add such conditional statement? I mainly operate on R using RandomForest but I am also a python user, so open to try scikit learn. Thanks submitted by /u/ElMochiKris [link] [comments]  ( 1 min )
    [D] Mining restaurant reviews for vibe
    tl;dr: I recently procured 1m+ restaurant reviews. Now I'd like to see if I can mine any hidden and interesting information from them to build a search system. Looking for advice. If you were doing this project - "determine the vibe of a restaurant based on the reviews. Allow a user to search based on expected loudness at venue" - how would you approach it? Boring text about what I've tried: So far, I've tried using top2vec on different granularity of the reviews. For example: each document is all reviews for a restaurant concatenated together each document is a complete review each document is a paragraph each document is a single sentence If I search for topics, I can certainly find some interesting ones. Here are the top 10 from the "complete review" model, for example: …  ( 3 min )
    [P] Extract relevant text from error logs
    I'm trying to extract relevant text from error logs (example below) then classifying by error reason. I've been doing (but mostly failing) it using the following process: Regex to filter out text between brackets and removing dates, times, IPs, filepaths, etc Remove stop words and keep only real words Put the resulting string of words through a sentence transformer (all-mpnet-base-v2 in this case) to get embeddings Reduce dimensionality and cluster Within clusters, use the most frequent words (up to 5) as a label for the "error reason" This process is really brittle because of the regex and because there is so much variability in log structures. Furthermore, the "error reason" labels are usually nonsensical because the process doesn't properly extract the relevant text. For the example below, I get "failed_run_coverage_statements_panic", when I'd hope to get something relating to timing out. Does anyone have any ideas of how I might approach this? Example error log: Failed === RUN GoJobSubscriber {"level":"info","ts":1649273105.2566006,"caller":"rabbitmq/connection.go:20","msg":"client successfully connected to RMQ"} {"level":"info","ts":1649273105.2577667,"caller":"rabbitmq/connection.go:28","msg":"client successfully created channel"} {"level":"info","ts":1649273105.258319,"caller":"rabbitmq/connection.go:34","msg":"client successfully configured QOS"} {"level":"info","ts":1649273105.2584598,"caller":"rabbitmq/connection.go:45","msg":"client successfully assigned dependencies"} {"level":"info","ts":1649273105.2894936,"caller":"rabbitmq/connection.go:52","msg":"client successfully ensured topology"} {"level":"info","ts":1649273105.2906847,"caller":"rabbitmq/subscription.go:46","msg":"Starting event subscriber...."} coverage: 55.8% of statements panic: test timed out after 2m0s goroutine 11 [running]: testing.(*M).startAlarm.func1() submitted by /u/iblis3 [link] [comments]  ( 1 min )
    [D] Do you use any tools to automate the data cleaning process?
    Data cleaning is very tedious. Are there solutions that actually solve the data cleaning problem or is it that we have to go through it manually because that helps to learn about the data? submitted by /u/Speech_titan [link] [comments]  ( 1 min )
    [2205.08957] Meta-Learning Sparse Compression Networks
    submitted by /u/Forsaken_Scientist [link] [comments]
    [D] Thinking about buying a Jetson Nano and Coral Stick for ML
    Currently I use several hours a day google colab and kaggle kernels to train models. But with the limitations of the free usage I ask myself worth it to buy a Jetson Nano 4GB and using the Coral Stick together as an alternative or buying colab pro is better? Most of time i use Tensorflow for mostly image/video classification/object detection Is it worth buying or has anyone had similar suggestions or experiences? submitted by /u/Stevy1981 [link] [comments]  ( 1 min )
    [D] Google Colab equivalent alternative to R / Rstudio
    I want to use GPU for running some R code I know that Google Colab is a good alternative if using python and allows access to GPU - is there an equivalent alternative to RStudio? What about RStudio server what are the advantages of Using Rstudio server when it anyway uses the computer CPU? submitted by /u/triary95 [link] [comments]  ( 1 min )
    [P] Reviewing Marketing Mix Model - concerned about a few issues...
    Hi, everyone. I am reviewing a marketing mix model done by someone before I started the job. They have a model which has been claimed to give an accurate attribution across different medial channels and other variables. It gives a pretty good result (R2 = 0.95), I admit. But I was concerned by the way the long-term effect of advertising has been worked into the model. The current model uses the sum of brand investment with a 95% carryover effect to account for the 'long term effect', while also using the same brand investment figure to account for the 'short term effect'. I pointed out that this would raise multicollinearity issues, but was told that this isn't a problem as the "modeling was done with an RF". Is there a way to better explain my concern? submitted by /u/SuperSodori [link] [comments]  ( 1 min )
    [D] Current use of word2vec in 2022
    We can all agree that word2vec was revolutionary in the field of NLP. My question is what is the current state of this technique? Is it mainly used for educational purposes to teach students about the building blocks of word embeddings? - Is it used in other fields such as studying the evolution of languages through historical texts? - Is it used in specific tasks such as only syntactic similarities, or only semantic similarities? - ... submitted by /u/LanverYT [link] [comments]  ( 2 min )
    [D] Is Mixed Market Modelling full of crap?
    I have spent the last year working as an Analyst for a large media company and have been primarily occupied with MMM projects. I need to clarify that I had never previously worked/studied economics and I come from a natural sciences background with a focus on statistics.Since I have been in this company, I have been struggling so much for several different reasons. One of these reasons is because I feel like the work we're doing is complete and utter bullshit! I am trying to figure out if the problem is me (which I am sure that some of it is) or if this is a common phenomenon. Every time I start modelling I am super enthusiastic about it and determined that this time it'll all make sense and every time I end up exasperated and ready to give up, quit my job and never come back. I feel like the models we are building are so full of crap, and simply aimed at justifying our client's expectations in order to make them happy - we always end up using mad coefficients and breakdowns of variables just so that they are what they should be.I consider myself to be very analytical and with good problem-solving skills - have been praised by my managers and given really good feedback for exceeding expectations and all that jazz. HOWEVER, I constantly feel like an utter failure, because I spend so much time trying to make sense of these models to the point where I exhaust myself and give up and then just do what I feel everyone else does - manipulate the metrics so that it works our way and this kills my soul every single time. Is this something that is widely happening in the industry and am I being too idealistic/perfectionist? Or am I seriously lacking some training (that I wouldn't say I got much of) and what can I do to improve? P.S. I am at the point where I have almost given up and I am close to leaving my job. I have run my mental health down so much to the point of burnout so any advice would be so very much appreciated! submitted by /u/necplorer [link] [comments]  ( 6 min )
    [P] Training to read PDF documents. Any ideas?
    Currently I use regex to identify patterns coupled with the OCR library (Tesseract) to import PDF data into dataframes. It's a bit of hit n miss process. Every new exception has to be built into regex. I was wondering if machine learning can take up the job? I was wondering one way to go about it is marking positions of data from the OCR to be read into respective columns but there might be a better way to go about it. Has anybody done this? submitted by /u/card_chase [link] [comments]  ( 4 min )
    [R] AI Research Rankings 2022: Sputnik Moment for China?
    Hey Reddit! A heated debate is going on today on the state of the strategic race between the United States and China to dominate in AI. I decided to gather some facts and analyzed publications at ICML 2021 and NeurIPS 2021. Here are the findings -- would love to hear what you think! 🤝❤️🤖 https://thundermark.medium.com/ai-research-rankings-2022-sputnik-moment-for-china-64b693386a4 submitted by /u/chersonesus [link] [comments]  ( 2 min )
  • Open

    Sim-2-real problem regarding system delay
    If the goal lies in training an agent for robot control policy, the actions stand for current values which control the robot joints. In the real system, however, there exist system delays and communication delays. So applying the actions to the robot would not directly result in motions, which is however in the case of simulation (for instance ISAAC GYM that I am using). As I have measured, the real system takes 250~300 ms to react to the given system input and rotate its joints. Therefore, the control policy trained in the simulator, where the system delay is almost 0~15 ms, is not useable anymore. What would be the approaches to overcome this sim-2-real problem in this case without identifying the model of the system? submitted by /u/Fun-Moose-3841 [link] [comments]  ( 1 min )
    Hierarchical Reinforcement Learning (HRL)
    Hello, I intend to apply HRL more specifically the option-critic where I have three options. however, I don´t see many documentation in the internet. Do you have any implementation code to get inspire from it (to see how it works). submitted by /u/GuavaAgreeable208 [link] [comments]  ( 1 min )
    Cliff Diving: Exploring Reward Surfaces in Reinforcement Learning Environments
    submitted by /u/jkterry1 [link] [comments]
    Ideas to start writing papers
    Hey there! I'm eager to start writing a paper, I have a background in AI, but I can't think of ideas for a topic to write a paper about. How do you guys get ideas for papers? Is it a good idea to start writing surveys? submitted by /u/ZazieIsInTheHouse [link] [comments]  ( 1 min )
    Channel policy on RL-related job opportunities
    Hi all, I am aware of two job opportunities at my company in the area of experimentation and RL (mainly MAB and cMAB style research and application). I was curious what this channels policy was on making people aware of these job openings or if there was a better venue for that (discord channels, etc). Let me know! Thanks! submitted by /u/Helga-Helga [link] [comments]  ( 1 min )
    Let's build an Autonomous Taxi 🚖 using Q-Learning (Deep Reinforcement Learning Free Class by Hugging Face 🤗)
    Hey there! I’m happy to announce that we just published the second Unit of Deep Reinforcement Learning Class) 🥳 In this Unit, we're going to dive deeper into one of the Reinforcement Learning methods: value-based methods and study our first RL algorithm: Q-Learning. We'll also implement our first RL agent from scratch: a Q-Learning agent and will train it in two environments and share it with the community: Frozen-Lake-v1 ⛄ (non-slippery version): where our agent will need to go from the starting state (S) to the goal state (G) by walking only on frozen tiles (F) and avoiding holes (H). An autonomous taxi 🚕 will need to learn to navigate a city to transport its passengers from point A to point B. You’ll be able to compare the results of your Q-Learning agent using the leaderboard 🏆 1️⃣ The introduction to q-learning part 1 article 👉 https://huggingface.co/blog/deep-rl-q-part1 2️⃣ The introduction to q-learning part 2 article 👉 https://huggingface.co/blog/deep-rl-q-part2 3️⃣ The hands-on 👉 https://github.com/huggingface/deep-rl-class/blob/main/unit2/unit2.ipynb 4️⃣ The leaderboard 👉 https://huggingface.co/spaces/chrisjay/Deep-Reinforcement-Learning-Leaderboard https://i.redd.it/2zk7rb9w9n091.gif If you have questions and feedback I would love to answer, submitted by /u/cranthir_ [link] [comments]  ( 1 min )
    Any recommended paper on using Deep RL on Drones
    Hi, I am a beginner in using Deep RL on robots (especially Drones). I have tried on RL OpenAI gym environments like Cartpole, Breakout and BipedalWalker but I have never applied it on actual robot. Now I wanted to apply DDPG on a drone (in simulation), I have failed miserably several times now. Is there any paper that explains it in detail? Any others resources are also welcomed. submitted by /u/Better-Ad8608 [link] [comments]  ( 1 min )
    What is offline reinforcement learning?
    Offline Reinforcement Learning(RL), also known as Batch Reinforcement Learning, is a variant of RL that effectively leverages large, previously collected datasets for large-scale real-world applications. The use of static datasets means that during the training process of the agent, offline RL does not perform any form of online interaction and exploration, which is also the most significant difference from online reinforcement learning methods. For convenience, we refer to non-offline reinforcement learning, including both on-policy and off-policy RL, as online reinforcement learning (Online RL) in the following sections. ​ Illustration of three classic online reinforcement learning modes In the figure, (a) stands for On-policy RL, where the agent uses the current policy πk to interact…  ( 8 min )
    Looking for project mates
    I'm trying to reimplement the paper High-Throughput Synchronous Deep RL using Ray. Is anyone interested in working on this together? submitted by /u/SirRantcelot [link] [comments]  ( 1 min )
  • Open

    Excuse the language, but why the fuck is there nothing accessible to the public right now that can generate genuinely comprehensible images like Dalle 2?
    Forgive me for sounding irritated, but is there absolutely nothing I can use right now, be it a website, program, mobile app, whatever, that I can just pump some cool stuff out of? Why are we all being shown this amazing technology only to be told "Yo this sick tech exists, but you ain't fuckin using it lol" except for a few select people (e.g. MKBHD's access to dalle 2 for a day).. I'm not talking about excuses such as nightcafe or wombodream.. I've used them to death and had nothing but quite frankly terrible results and want something that I can pop in terms such as, I don't know, "Drift car nissan silvia" that gives a picture of an actual car for album art etc, or something along those lines, if not just to admire how crazy AI is becoming. Look, if this stuff existed but was kept completely secret, then the whole 'what you don't know can't hurt you' idea would apply and I would not be pissed off. What is frustrating is that all this wizardry is being flaunted and dangled in our faces whilst also being kept out of reach. So my initial question still applies, is there anything remotely available right now that can generate images somewhat comparible? Thanks! submitted by /u/Michael_Goodwin [link] [comments]  ( 3 min )
    TPI - an open-source tool that is the first to train machine learning models on any cloud using Terraform solution
    Terraform Provider Iterative (TPI) is first technological product that simplifies ML training on any cloud as it helps the infrastructure and ML team members save significant time and money in maintaining and configuring the training infrastructure: Iterative adoption of an open-source tool that is the first to train machine learning models on any cloud using Terraform Terraform provides a flexible CLI service system for managing hundreds of cloud services, and TPI enables data scientists to delegate responsibilities without discovering software. submitted by /u/thumbsdrivesmecrazy [link] [comments]  ( 1 min )
    Rephrasing an English paragraph with using AI - Writing is all about culture
    submitted by /u/data-gig [link] [comments]
    OpenAI's DALL-E 2 is pretty compliant - but who is responsible anyway?
    submitted by /u/much_successes [link] [comments]  ( 1 min )
    A Roadmap overview To A Career In AI
    According to the World Economic Forum, over 97 million more jobs may be created by 2025, all due to the application of AI in different fields. AI is one of the century’s most important and disruptive technology achievements. Here is A Roadmap overview To A Career In AI submitted by /u/artiba-AI [link] [comments]
    This article showcases a tutorial on fine-tuning layoutLM v2 for invoice recognition starting from annotation to training and inference.
    submitted by /u/UBIAI [link] [comments]
    Automated CV Pipelines | Box to Polygon
    The 5th episode of the webinar series on Automated CV Pipelines is coming up! It will be covering automatic instance segmentation and methods to streamline the annotation process. If you're interested, you can register here! submitted by /u/WeekendClassic [link] [comments]
    Google AI Plans for 2030
    When someone searches a query on Google, it leaves a carbon dioxide footprint in the atmosphere. Google AI plans for 2030 have been revealed, discussing what goals Google has to lower its carbon footprint and allowing other partners to play their part in guaranteeing environmental sustainability. Read more submitted by /u/ridamughal110 [link] [comments]
    Top AI Investment Opportunities In 2021
    There are multiple reasons companies are finding investment opportunities in AI that are going to be beneficial for them in the year 2021. Investment in AI will help a wide range of organizations go through the economic crisis as they emerge from the pandemic. Read more submitted by /u/ridamughal110 [link] [comments]  ( 1 min )
    How Air Traffic Can Be Optimized Using Artificial Intelligence
    With the fast-growing and high-density global air traffic, ensuring efficiency and air transportation safety becomes a critical challenge. AI is already revolutionizing the way air traffic management systems are manufactured and hence is believed to play a key role in optimizing air traffic flow. Read more submitted by /u/ridamughal110 [link] [comments]
    Is the advancement of AI in accordance with the best long-term interests of Humanity?
    AI advancements in digital technology are growing, and today we have far more technical capability than we had in the 90s, with the potential to expand even quicker in the future. Is, however, the continuance of AI growth in the best interests of humanity? Read more submitted by /u/ridamughal110 [link] [comments]  ( 1 min )
    Is artificial intelligence the most powerful thing that humans have ever created?
    Artificial Intelligence is one of the most powerful things humans have been working on for decades, and its limitless magical spells are altering our lives. Read more submitted by /u/ridamughal110 [link] [comments]  ( 1 min )
    Build ML with zero data, training or setup with Humingbird!
    submitted by /u/holamyeung [link] [comments]  ( 2 min )
    Where can I best get OPT 175B to run?
    I know I sound like a douche. I got access to the OPT 175B mode for my research, but my universitie’s GPU capabilities aren’t sufficient. Usually, I train my LLM on two local 50GB GPUs, that doesn’t seem to work now - so - what would you recommend? submitted by /u/Trick_Brain [link] [comments]  ( 1 min )
    Reinforcement learning companies
    Are there any companies (startups are fine too) doing work in Reinforcement Learning - more specifically in game NPCs/bots that you guys know? submitted by /u/Nice_Working [link] [comments]
    Please help me out in my research.
    I used the glob function, but instead of getting 33 categories I only get one, hence the output layer of the model is just 1, how could I fix it? Thank you guys in advance and all the help will be appreciated https://colab.research.google.com/drive/1NdpWmOVbqV2EUJ50vLj7nhWdWWHtbY_8?usp=sharing submitted by /u/Vector-Desperandum24 [link] [comments]  ( 1 min )
  • Open

    5 Ways to Optimize Database Performance
    Database performance allows developers or database administrators to enhance the system resources for lasting performance improvements. Databases are like the central nervous system of an application. They are responsible for the organization and function of critical processes. The minor database-related performance issues have the ability to impact the entire operation. Locating issues in the databases… Read More »5 Ways to Optimize Database Performance The post 5 Ways to Optimize Database Performance appeared first on Data Science Central.  ( 6 min )
  • Open

    AI Predicts Volcano Eruptions + Rapidly Detect Earthquakes | AI Robot Autonomously Reduces Jellyfish Overpopulation | AI Detects Pancreatic Cancer
    submitted by /u/getrich_or_diemining [link] [comments]  ( 1 min )
    Advice
    Hello, I need help with finding weights on a couple of outputs but I don't quite understand how it works fully. Could anyone help me? Its for educational purposes, I just want advice and help, not to have the work done for me. submitted by /u/106steve [link] [comments]
  • Open

    What is Extended Reality?
    Advances in extended reality have already changed the way we work, live and play, and it’s just getting started. Extended reality, or XR, is an umbrella category that covers a spectrum of newer, immersive technologies, including virtual reality, augmented reality and mixed reality. From gaming to virtual production to product design, XR has enabled people Read article > The post What is Extended Reality? appeared first on NVIDIA Blog.  ( 5 min )
    From Cloud to Car: How NIO Develops Intelligent Vehicles on NVIDIA HGX
    Building next-generation intelligent vehicles requires an AI infrastructure that pushes the cutting edge. Electric vehicle maker NIO is using NVIDIA HGX to build a comprehensive data center infrastructure for developing AI-powered, software-defined vehicles. With high-performance compute, the automaker can continuously iterate on sophisticated deep learning models, creating robust autonomous driving algorithms in a closed-loop environment. Read article > The post From Cloud to Car: How NIO Develops Intelligent Vehicles on NVIDIA HGX appeared first on NVIDIA Blog.  ( 3 min )
  • Open

    ML for Algorithmic Trading, with Stefan Jansen
    Listen to this episode on Anchor FM. In this episode of the DATAcated podcast, host Kate Strachnyi talks with Stefan Jansen about machine… Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 3 min )
    Data Science Project Vs Machine Learning Project: Python Data Science Complete Course — Part2
    Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 5 min )
  • Open

    The Berkeley Crossword Solver
    We recently built the Berkeley Crossword Solver (BCS), the first computer program to beat every human competitor in the world’s top crossword tournament. The BCS combines neural question answering and probabilistic inference to achieve near-perfect performance on most American-style crossword puzzles, like the one shown below: Crosswords are challenging for humans and computers alike. Many clues are vague or underspecified and can’t be answered until crossing constraints are taken into account. While some clues are similar to factoid question answering, others require relational reasoning or understanding difficult wordplay. Here are a handful of example clues from our dataset (answers at the bottom of this post): They’re given out at Berkeley’s HAAS School (4) Winter hrs. in Berkeley (3) …  ( 3 min )
  • Open

    Artificial intelligence predicts patients’ race from their medical images
    Study shows AI can identify self-reported race from medical images that contain no indications of race detectable by human experts.  ( 7 min )
  • Open

    Design choice and machine learning model performances. (arXiv:2201.10239v2 [stat.ML] UPDATED)
    An increasing number of publications present the joint application of Design of Experiments (DOE) and machine learning (ML) as a methodology to collect and analyze data on a specific industrial phenomenon. However, the literature shows that the choice of the design for data collection and model for data analysis is often not driven by statistical or algorithmic advantages, thus there is a lack of studies which provide guidelines on what designs and ML models to jointly use for data collection and analysis. This article discusses the choice of design in relation to the ML model performances. A study is conducted that considers 12 experimental designs, 7 families of predictive models, 7 test functions that emulate physical processes, and 8 noise settings, both homoscedastic and heteroscedastic. The results of the research can have an immediate impact on the work of practitioners, providing guidelines for practical applications of DOE and ML.
    Class-Aware Generative Adversarial Transformers for Medical Image Segmentation. (arXiv:2201.10737v3 [cs.CV] UPDATED)
    Transformers have made remarkable progress towards modeling long-range dependencies within the medical image analysis domain. However, current transformer-based models suffer from several disadvantages: (1) existing methods fail to capture the important features of the images due to the naive tokenization scheme; (2) the models suffer from information loss because they only consider single-scale feature representations; and (3) the segmentation label maps generated by the models are not accurate enough without considering rich semantic contexts and anatomical textures. In this work, we present CASTformer, a novel type of generative adversarial transformers, for 2D medical image segmentation. First, we take advantage of the pyramid structure to construct multi-scale representations and handle multi-scale variations. We then design a novel class-aware transformer module to better learn the discriminative regions of objects with semantic structures. Lastly, we utilize an adversarial training strategy that boosts segmentation accuracy and correspondingly allows a transformer-based discriminator to capture high-level semantically correlated contents and low-level anatomical features. Our experiments demonstrate that CASTformer dramatically outperforms previous state-of-the-art transformer-based approaches on three benchmarks, obtaining 2.54%-5.88% absolute improvements in Dice over previous models. Further qualitative experiments provide a more detailed picture of the model's inner workings, shed light on the challenges in improved transparency, and demonstrate that transfer learning can greatly improve performance and reduce the size of medical image datasets in training, making CASTformer a strong starting point for downstream medical image analysis tasks.
    Universal Lower Bound for Learning Causal DAGs with Atomic Interventions. (arXiv:2111.05070v4 [cs.LG] UPDATED)
    A well-studied challenge that arises in the structure learning problem of causal directed acyclic graphs (DAG) is that using observational data, one can only learn the graph up to a "Markov equivalence class" (MEC). The remaining undirected edges have to be oriented using interventions, which can be very expensive to perform in applications. Thus, the problem of minimizing the number of interventions needed to fully orient the MEC has received a lot of recent attention, and is also the focus of this work. Our first result is a new universal lower bound on the number of single-node interventions that any algorithm (whether active or passive) would need to perform in order to orient a given MEC. Our second result shows that this bound is, in fact, within a factor of two of the size of the smallest set of single-node interventions that can orient the MEC. Our lower bound is provably better than previously known lower bounds. Further, using simulations on synthetic graphs and by giving examples of special graph families, we show that our bound is often significantly better. To prove our lower bound, we develop the notion of clique-block shared-parents (CBSP) orderings, which are topological orderings of DAGs without v-structures and satisfy certain special properties. We also use the techniques developed here to extend our results to the setting of multi-node interventions.
    Unsupervised Learning of Rydberg Atom Array Phase Diagram with Siamese Neural Networks. (arXiv:2205.04051v2 [physics.comp-ph] UPDATED)
    We introduce an unsupervised machine learning method based on Siamese Neural Networks (SNN) to detect phase boundaries. This method is applied to Monte-Carlo simulations of Ising-type systems and Rydberg atom arrays. In both cases the SNN reveals phase boundaries consistent with prior research. The combination of leveraging the power of feed-forward neural networks, unsupervised learning and the ability to learn about multiple phases without knowing about their existence provides a powerful method to explore new and unknown phases of matter.
    SepTr: Separable Transformer for Audio Spectrogram Processing. (arXiv:2203.09581v2 [cs.CV] UPDATED)
    Following the successful application of vision transformers in multiple computer vision tasks, these models have drawn the attention of the signal processing community. This is because signals are often represented as spectrograms (e.g. through Discrete Fourier Transform) which can be directly provided as input to vision transformers. However, naively applying transformers to spectrograms is suboptimal. Since the axes represent distinct dimensions, i.e. frequency and time, we argue that a better approach is to separate the attention dedicated to each axis. To this end, we propose the Separable Transformer (SepTr), an architecture that employs two transformer blocks in a sequential manner, the first attending to tokens within the same frequency bin, and the second attending to tokens within the same time interval. We conduct experiments on three benchmark data sets, showing that our separable architecture outperforms conventional vision transformers and other state-of-the-art methods. Unlike standard transformers, SepTr linearly scales the number of trainable parameters with the input size, thus having a lower memory footprint. Our code is available as open source at https://github.com/ristea/septr.
    Trajectory Inference via Mean-field Langevin in Path Space. (arXiv:2205.07146v2 [math.OC] UPDATED)
    Trajectory inference aims at recovering the dynamics of a population from snapshots of its temporal marginals. To solve this task, a min-entropy estimator relative to the Wiener measure in path space was introduced by Lavenant et al. arXiv:2102.09204, and shown to consistently recover the dynamics of a large class of drift-diffusion processes from the solution of an infinite dimensional convex optimization problem. In this paper, we introduce a grid-free algorithm to compute this estimator. Our method consists in a family of point clouds (one per snapshot) coupled via Schr\"odinger bridges which evolve with noisy gradient descent. We study the mean-field limit of the dynamics and prove its global convergence at an exponential rate to the desired estimator. Overall, this leads to an inference method with end-to-end theoretical guarantees that solves an interpretable model for trajectory inference. We also present how to adapt the method to deal with mass variations, a useful extension when dealing with single cell RNA-sequencing data where cells can branch and die.
    Personalized Interventions for Online Moderation. (arXiv:2205.09462v1 [cs.SI])
    Current online moderation follows a one-size-fits-all approach, where each intervention is applied in the same way to all users. This naive approach is challenged by established socio-behavioral theories and by recent empirical results that showed the limited effectiveness of such interventions. We propose a paradigm-shift in online moderation by moving towards a personalized and user-centered approach. Our multidisciplinary vision combines state-of-the-art theories and practices in diverse fields such as computer science, sociology and psychology, to design personalized moderation interventions (PMIs). In outlining the path leading to the next-generation of moderation interventions, we also discuss the most prominent challenges introduced by such a disruptive change.
    Truncated tensor Schatten p-norm based approach for spatiotemporal traffic data imputation with complicated missing patterns. (arXiv:2205.09390v1 [stat.ML])
    Rapid advances in sensor, wireless communication, cloud computing and data science have brought unprecedented amount of data to assist transportation engineers and researchers in making better decisions. However, traffic data in reality often has corrupted or incomplete values due to detector and communication malfunctions. Data imputation is thus required to ensure the effectiveness of downstream data-driven applications. To this end, numerous tensor-based methods treating the imputation problem as the low-rank tensor completion (LRTC) have been attempted in previous works. To tackle rank minimization, which is at the core of the LRTC, most of aforementioned methods utilize the tensor nuclear norm (NN) as a convex surrogate for the minimization. However, the over-relaxation issue in NN refrains it from desirable performance in practice. In this paper, we define an innovative nonconvex truncated Schatten p-norm for tensors (TSpN) to approximate tensor rank and impute missing spatiotemporal traffic data under the LRTC framework. We model traffic data into a third-order tensor structure of (time intervals,locations (sensors),days) and introduce four complicated missing patterns, including random missing and three fiber-like missing cases according to the tensor mode-n fibers. Despite nonconvexity of the objective function in our model, we derive the global optimal solutions by integrating the alternating direction method of multipliers (ADMM) with generalized soft-thresholding (GST). In addition, we design a truncation rate decay strategy to deal with varying missing rate scenarios. Comprehensive experiments are finally conducted using real-world spatiotemporal datasets, which demonstrate that the proposed LRTC-TSpN method performs well under various missing cases, meanwhile outperforming other SOTA tensor-based imputation models in almost all scenarios.
    Evolutionary latent space search for driving human portrait generation. (arXiv:2204.11887v2 [cs.CV] UPDATED)
    This article presents an evolutionary approach for synthetic human portraits generation based on the latent space exploration of a generative adversarial network. The idea is to produce different human face images very similar to a given target portrait. The approach applies StyleGAN2 for portrait generation and FaceNet for face similarity evaluation. The evolutionary search is based on exploring the real-coded latent space of StyleGAN2. The main results over both synthetic and real images indicate that the proposed approach generates accurate and diverse solutions, which represent realistic human portraits. The proposed research can contribute to improving the security of face recognition systems.
    Approximating Persistent Homology for Large Datasets. (arXiv:2204.09155v2 [stat.ML] UPDATED)
    Persistent homology is an important methodology from topological data analysis which adapts theory from algebraic topology to data settings and has been successfully implemented in many applications. It produces a statistical summary in the form of a persistence diagram, which captures the shape and size of the data. Despite its widespread use, persistent homology is simply impossible to implement when a dataset is very large. In this paper we address the problem of finding a representative persistence diagram for prohibitively large datasets. We adapt the classical statistical method of bootstrapping, namely, drawing and studying smaller multiple subsamples from the large dataset. We show that the mean of the persistence diagrams of subsamples -- taken as a mean persistence measure computed from the subsamples -- is a valid approximation of the true persistent homology of the larger dataset. We give the rate of convergence of the mean persistence diagram to the true persistence diagram in terms of the number of subsamples and size of each subsample. Given the complex algebraic and geometric nature of persistent homology, we adapt the convexity and stability properties in the space of persistence diagrams together with random set theory to achieve our theoretical results for the general setting of point cloud data. We demonstrate our approach on simulated and real data, including an application of shape clustering on complex large-scale point cloud data.
    Causal Inference from Small High-dimensional Datasets. (arXiv:2205.09281v1 [cs.LG])
    Many methods have been proposed to estimate treatment effects with observational data. Often, the choice of the method considers the application's characteristics, such as type of treatment and outcome, confounding effect, and the complexity of the data. These methods implicitly assume that the sample size is large enough to train such models, especially the neural network-based estimators. What if this is not the case? In this work, we propose Causal-Batle, a methodology to estimate treatment effects in small high-dimensional datasets in the presence of another high-dimensional dataset in the same feature space. We adopt an approach that brings transfer learning techniques into causal inference. Our experiments show that such an approach helps to bring stability to neural network-based methods and improve the treatment effect estimates in small high-dimensional datasets.
    Attracting and Dispersing: A Simple Approach for Source-free Domain Adaptation. (arXiv:2205.04183v2 [cs.CV] UPDATED)
    We propose a simple but effective source-free domain adaptation (SFDA) method. Treating SFDA as an unsupervised clustering problem and following the intuition that local neighbors in feature space should have more similar predictions than other features, we propose to optimize an objective of prediction consistency. This objective encourages local neighborhood features in feature space to have similar predictions while features farther away in feature space have dissimilar predictions, leading to efficient feature clustering and cluster assignment simultaneously. For efficient training, we seek to optimize an upper-bound of the objective resulting in two simple terms. Furthermore, we relate popular existing methods in domain adaptation, source-free domain adaptation and contrastive learning via the perspective of discriminability and diversity. The experimental results prove the superiority of our method, and our method can be adopted as a simple but strong baseline for future research in SFDA. Our method can be also adapted to source-free open-set and partial-set DA which further shows the generalization ability of our method.
    High-resolution landscape-scale biomass mapping using a spatiotemporal patchwork of LiDAR coverages. (arXiv:2205.08530v1 [stat.AP] CROSS LISTED)
    Estimating forest aboveground biomass at fine spatial scales has become increasingly important for greenhouse gas estimation, monitoring, and verification efforts to mitigate climate change. Airborne LiDAR continues to be a valuable source of remote sensing data for estimating aboveground biomass. However airborne LiDAR collections may take place at local or regional scales covering irregular, non-contiguous footprints, resulting in a 'patchwork' of different landscape segments at different points in time. Here we addressed common obstacles including selection of training data, the investigation of regional or coverage specific patterns in bias and error, and map agreement, and model-based precision assessments at multiple scales. Three machine learning algorithms and an ensemble model were trained using field inventory data (FIA), airborne LiDAR, and topographic, climatic and cadastral geodata. Using strict selection criteria, 801 FIA plots were selected with co-located point clouds drawn from a patchwork of 17 leaf-off LiDAR coverages 2014-2019). Our ensemble model created 30m AGB prediction surfaces within a predictor-defined area of applicability (98% of LiDAR coverage) and resulting AGB predictions were compared with FIA plot-level and areal estimates at multiple scales of aggregation. Our model was overall accurate (% RMSE 13-33%), had very low bias (MBE $\leq$ $\pm$5 Mg ha$^{-1}$), explained most field-observed variation (R$^2$ 0.74-0.93), produced estimates that were both largely consistent with FIA's aggregate summaries (86% of estimates within 95% CI), as well as precise when aggregated to arbitrary small-areas (mean bootstrap standard error 0.37 Mg ha$^{-1}$). We share practical solutions to challenges faced when using spatiotemporal patchworks of LiDAR to meet growing needs for biomass prediction and mapping, and applications in carbon accounting and ecosystem stewardship.
    Multi-DNN Accelerators for Next-Generation AI Systems. (arXiv:2205.09376v1 [cs.AR])
    As the use of AI-powered applications widens across multiple domains, so do increase the computational demands. Primary driver of AI technology are the deep neural networks (DNNs). When focusing either on cloud-based systems that serve multiple AI queries from different users each with their own DNN model, or on mobile robots and smartphones employing pipelines of various models or parallel DNNs for the concurrent processing of multi-modal data, the next generation of AI systems will have multi-DNN workloads at their core. Large-scale deployment of AI services and integration across mobile and embedded systems require additional breakthroughs in the computer architecture front, with processors that can maintain high performance as the number of DNNs increases while meeting the quality-of-service requirements, giving rise to the topic of multi-DNN accelerator design.
    Simplifying Node Classification on Heterophilous Graphs with Compatible Label Propagation. (arXiv:2205.09389v1 [cs.LG])
    Graph Neural Networks (GNNs) have been predominant for graph learning tasks; however, recent studies showed that a well-known graph algorithm, Label Propagation (LP), combined with a shallow neural network can achieve comparable performance to GNNs in semi-supervised node classification on graphs with high homophily. In this paper, we show that this approach falls short on graphs with low homophily, where nodes often connect to the nodes of the opposite classes. To overcome this, we carefully design a combination of a base predictor with LP algorithm that enjoys a closed-form solution as well as convergence guarantees. Our algorithm first learns the class compatibility matrix and then aggregates label predictions using LP algorithm weighted by class compatibilities. On a wide variety of benchmarks, we show that our approach achieves the leading performance on graphs with various levels of homophily. Meanwhile, it has orders of magnitude fewer parameters and requires less execution time. Empirical evaluations demonstrate that simple adaptations of LP can be competitive in semi-supervised node classification in both homophily and heterophily regimes.
    Interpretable Latent Variables in Deep State Space Models. (arXiv:2203.02057v2 [stat.ML] UPDATED)
    We introduce a new version of deep state-space models (DSSMs) that combines a recurrent neural network with a state-space framework to forecast time series data. The model estimates the observed series as functions of latent variables that evolve non-linearly through time. Due to the complexity and non-linearity inherent in DSSMs, previous works on DSSMs typically produced latent variables that are very difficult to interpret. Our paper focus on producing interpretable latent parameters with two key modifications. First, we simplify the predictive decoder by restricting the response variables to be a linear transformation of the latent variables plus some noise. Second, we utilize shrinkage priors on the latent variables to reduce redundancy and improve robustness. These changes make the latent variables much easier to understand and allow us to interpret the resulting latent variables as random effects in a linear mixed model. We show through two public benchmark datasets the resulting model improves forecasting performances.
    Towards Applicable Reinforcement Learning: Improving the Generalization and Sample Efficiency with Policy Ensemble. (arXiv:2205.09284v1 [cs.LG])
    It is challenging for reinforcement learning (RL) algorithms to succeed in real-world applications like financial trading and logistic system due to the noisy observation and environment shifting between training and evaluation. Thus, it requires both high sample efficiency and generalization for resolving real-world tasks. However, directly applying typical RL algorithms can lead to poor performance in such scenarios. Considering the great performance of ensemble methods on both accuracy and generalization in supervised learning (SL), we design a robust and applicable method named Ensemble Proximal Policy Optimization (EPPO), which learns ensemble policies in an end-to-end manner. Notably, EPPO combines each policy and the policy ensemble organically and optimizes both simultaneously. In addition, EPPO adopts a diversity enhancement regularization over the policy space which helps to generalize to unseen states and promotes exploration. We theoretically prove EPPO increases exploration efficacy, and through comprehensive experimental evaluations on various tasks, we demonstrate that EPPO achieves higher efficiency and is robust for real-world applications compared with vanilla policy optimization algorithms and other ensemble methods. Code and supplemental materials are available at https://seqml.github.io/eppo.
    FlowPool: Pooling Graph Representations with Wasserstein Gradient Flows. (arXiv:2112.09990v2 [cs.LG] UPDATED)
    In several machine learning tasks for graph structured data, the graphs under consideration may be composed of a varying number of nodes. Therefore, it is necessary to design pooling methods that aggregate the graph representations of varying size to representations of fixed size which can be used in downstream tasks, such as graph classification. Existing graph pooling methods offer no guarantee with regards to the similarity of a graph representation and its pooled version. In this work, we address this limitation by proposing FlowPool, a pooling method that optimally preserves the statistics of a graph representation to its pooled counterpart by minimising their Wasserstein distance. This is achieved by performing a Wasserstein gradient flow with respect to the pooled graph representation. Our method relies on a versatile implementation which can take into account the geometry of the representation space through any ground cost and computes the gradient of the Wasserstein distance with automatic differentiation. We propose the differentiation of the Wasserstein flow layer using an implicit differentiation scheme. Therefore, our pooling method is amenable to automatic differentiation and can be integrated in end-to-end deep learning architectures. Further, FlowPool is invariant to permutations and can therefore be combined with permutation equivariant feature extraction layers in GNNs in order to obtain predictions that are independent of the ordering of the nodes. Experimental results demonstrate that our method leads to an increase in performance compared to existing pooling methods when evaluated on graph classification.
    Time Series Generation with Masked Autoencoder. (arXiv:2201.07006v3 [cs.LG] UPDATED)
    This paper shows that masked autoencoder with extrapolator (ExtraMAE) is a scalable self-supervised model for time series generation. ExtraMAE randomly masks some patches of the original time series and learns temporal dynamics by recovering the masked patches. Our approach has two core designs. First, ExtraMAE is self-supervised. Supervision allows ExtraMAE to effectively and efficiently capture the temporal dynamics of the original time series. Second, ExtraMAE proposes an extrapolator to disentangle two jobs of the decoder: recovering latent representations and mapping them back into the feature space. These unique designs enable ExtraMAE to consistently and significantly outperform state-of-the-art (SoTA) benchmarks in time series generation. The lightweight architecture also makes ExtraMAE fast and scalable. ExtraMAE shows outstanding behavior in various downstream tasks such as time series classification, prediction, and imputation. As a self-supervised generative model, ExtraMAE allows explicit management of the synthetic data. We hope this paper will usher in a new era of time series generation with self-supervised models.
    Speeding Up Entmax. (arXiv:2111.06832v3 [cs.CL] UPDATED)
    Softmax is the de facto standard in modern neural networks for language processing when it comes to normalizing logits. However, by producing a dense probability distribution each token in the vocabulary has a nonzero chance of being selected at each generation step, leading to a variety of reported problems in text generation. $\alpha$-entmax of Peters et al. (2019, arXiv:1905.05702) solves this problem, but is considerably slower than softmax. In this paper, we propose an alternative to $\alpha$-entmax, which keeps its virtuous characteristics, but is as fast as optimized softmax and achieves on par or better performance in machine translation task.
    Theory of Acceleration of Decision Making by Correlated Time Sequences. (arXiv:2203.16004v3 [cs.LG] UPDATED)
    Photonic accelerators have been intensively studied to provide enhanced information processing capability to benefit from the unique attributes of physical processes. Recently, it has been reported that chaotically oscillating ultrafast time series from a laser, called laser chaos, provides the ability to solve multi-armed bandit (MAB) problems or decision-making problems at GHz order. Furthermore, it has been confirmed that the negatively correlated time-domain structure of laser chaos contributes to the acceleration of decision-making. However, the underlying mechanism of why decision-making is accelerated by correlated time series is unknown. In this paper, we demonstrate a theoretical model to account for the acceleration of decision-making by correlated time sequence. We first confirm the effectiveness of the negative autocorrelation inherent in time series for solving two-armed bandit problems using Fourier transform surrogate methods. We propose a theoretical model that concerns the correlated time series subjected to the decision-making system and the internal status of the system therein in a unified manner, inspired by correlated random walks. We demonstrate that the performance derived analytically by the theory agrees well with the numerical simulations, which confirms the validity of the proposed model and leads to optimal system design. The present study paves the new way for the effectiveness of correlated time series for decision-making, impacting artificial intelligence and other applications.
    Overcoming challenges in leveraging GANs for few-shot data augmentation. (arXiv:2203.16662v2 [stat.ML] UPDATED)
    In this paper, we explore the use of GAN-based few-shot data augmentation as a method to improve few-shot classification performance. We perform an exploration into how a GAN can be fine-tuned for such a task (one of which is in a class-incremental manner), as well as a rigorous empirical investigation into how well these models can perform to improve few-shot classification. We identify issues related to the difficulty of training such generative models under a purely supervised regime with very few examples, as well as issues regarding the evaluation protocols of existing works. We also find that in this regime, classification accuracy is highly sensitive to how the classes of the dataset are randomly split. Therefore, we propose a semi-supervised fine-tuning approach as a more pragmatic way forward to address these problems.
    Deep Dynamic Effective Connectivity Estimation from Multivariate Time Series. (arXiv:2202.02393v3 [cs.LG] UPDATED)
    Recently, methods that represent data as a graph, such as graph neural networks (GNNs) have been successfully used to learn data representations and structures to solve classification and link prediction problems. The applications of such methods are vast and diverse, but most of the current work relies on the assumption of a static graph. This assumption does not hold for many highly dynamic systems, where the underlying connectivity structure is non-stationary and is mostly unobserved. Using a static model in these situations may result in sub-optimal performance. In contrast, modeling changes in graph structure with time can provide information about the system whose applications go beyond classification. Most work of this type does not learn effective connectivity and focuses on cross-correlation between nodes to generate undirected graphs. An undirected graph is unable to capture direction of an interaction which is vital in many fields, including neuroscience. To bridge this gap, we developed dynamic effective connectivity estimation via neural network training (DECENNT), a novel model to learn an interpretable directed and dynamic graph induced by the downstream classification/prediction task. DECENNT outperforms state-of-the-art (SOTA) methods on five different tasks and infers interpretable task-specific dynamic graphs. The dynamic graphs inferred from functional neuroimaging data align well with the existing literature and provide additional information. Additionally, the temporal attention module of DECENNT identifies time-intervals crucial for predictive downstream task from multivariate time series data.
    Backdoor Detection in Reinforcement Learning. (arXiv:2202.03609v2 [cs.LG] UPDATED)
    While the real world application of reinforcement learning (RL) is becoming popular, the safety concern and the robustness of an RL system require more attention. A recent work reveals that, in a multi-agent RL environment, backdoor trigger actions can be injected into a victim agent (a.k.a. trojan agent), which can result in a catastrophic failure as soon as it sees the backdoor trigger action. We propose the problem of RL Backdoor Detection, aiming to address this safety vulnerability. An interesting observation we drew from extensive empirical studies is a trigger smoothness property where normal actions similar to the backdoor trigger actions can also trigger low performance of the trojan agent. Inspired by this observation, we propose a reinforcement learning solution TrojanSeeker to find approximate trigger actions for the trojan agents, and further propose an efficient approach to mitigate the trojan agents based on machine unlearning. Experiments show that our approach can correctly distinguish and mitigate all the trojan agents across various types of agents and environments.
    STOPS: Short-Term-based Volatility-controlled Policy Search and its Global Convergence. (arXiv:2201.09857v4 [cs.LG] UPDATED)
    It remains challenging to deploy existing risk-averse approaches to real-world applications. The reasons are multi-fold, including the lack of global optimality guarantee and the necessity of learning from long-term consecutive trajectories. Long-term consecutive trajectories are prone to involving visiting hazardous states, which is a major concern in the risk-averse setting. This paper proposes Short-Term VOlatility-controlled Policy Search (STOPS), a novel algorithm that solves risk-averse problems by learning from short-term trajectories instead of long-term trajectories. Short-term trajectories are more flexible to generate, and can avoid the danger of hazardous state visitations. By using an actor-critic scheme with an overparameterized two-layer neural network, our algorithm finds a globally optimal policy at a sublinear rate with proximal policy optimization and natural policy gradient, with effectiveness comparable to the state-of-the-art convergence rate of risk-neutral policy-search methods. The algorithm is evaluated on challenging Mujoco robot simulation tasks under the mean-variance evaluation metric. Both theoretical analysis and experimental results demonstrate a state-of-the-art level of STOPS' performance among existing risk-averse policy search methods.
    BR-NPA: A Non-Parametric High-Resolution Attention Model to improve the Interpretability of Attention. (arXiv:2106.02566v4 [cs.CV] UPDATED)
    The prevalence of employing attention mechanisms has brought along concerns on the interpretability of attention distributions. Although it provides insights about how a model is operating, utilizing attention as the explanation of model predictions is still highly dubious. The community is still seeking more interpretable strategies for better identifying local active regions that contribute the most to the final decision. To improve the interpretability of existing attention models, we propose a novel Bilinear Representative Non-Parametric Attention (BR-NPA) strategy that captures the task-relevant human-interpretable information. The target model is first distilled to have higher-resolution intermediate feature maps. From which, representative features are then grouped based on local pairwise feature similarity, to produce finer-grained, more precise attention maps highlighting task-relevant parts of the input. The obtained attention maps are ranked according to the activity level of the compound feature, which provides information regarding the important level of the highlighted regions. The proposed model can be easily adapted in a wide variety of modern deep models, where classification is involved. Extensive quantitative and qualitative experiments showcase more comprehensive and accurate visual explanations compared to state-of-the-art attention models and visualizations methods across multiple tasks including fine-grained image classification, few-shot classification, and person re-identification, without compromising the classification accuracy. The proposed visualization model sheds imperative light on how neural networks `pay their attention' differently in different tasks.
    k-strip: A novel segmentation algorithm in k-space for the application of skull stripping. (arXiv:2205.09706v1 [eess.IV])
    Objectives: Present a novel deep learning-based skull stripping algorithm for magnetic resonance imaging (MRI) that works directly in the information rich k-space. Materials and Methods: Using two datasets from different institutions with a total of 36,900 MRI slices, we trained a deep learning-based model to work directly with the complex raw k-space data. Skull stripping performed by HD-BET (Brain Extraction Tool) in the image domain were used as the ground truth. Results: Both datasets were very similar to the ground truth (DICE scores of 92\%-98\% and Hausdorff distances of under 5.5 mm). Results on slices above the eye-region reach DICE scores of up to 99\%, while the accuracy drops in regions around the eyes and below, with partially blurred output. The output of k-strip often smoothed edges at the demarcation to the skull. Binary masks are created with an appropriate threshold. Conclusion: With this proof-of-concept study, we were able to show the feasibility of working in the k-space frequency domain, preserving phase information, with consistent results. Future research should be dedicated to discovering additional ways the k-space can be used for innovative image analysis and further workflows.
    Challenges in Deploying Machine Learning: a Survey of Case Studies. (arXiv:2011.09926v3 [cs.LG] UPDATED)
    In recent years, machine learning has transitioned from a field of academic research interest to a field capable of solving real-world business problems. However, the deployment of machine learning models in production systems can present a number of issues and concerns. This survey reviews published reports of deploying machine learning solutions in a variety of use cases, industries and applications and extracts practical considerations corresponding to stages of the machine learning deployment workflow. By mapping found challenges to the steps of the machine learning deployment workflow we show that practitioners face issues at each stage of the deployment process. The goal of this paper is to lay out a research agenda to explore approaches addressing these challenges.
    Cross-modal Learning of Graph Representations using Radar Point Cloud for Long-Range Gesture Recognition. (arXiv:2203.17066v2 [eess.SP] UPDATED)
    Gesture recognition is one of the most intuitive ways of interaction and has gathered particular attention for human computer interaction. Radar sensors possess multiple intrinsic properties, such as their ability to work in low illumination, harsh weather conditions, and being low-cost and compact, making them highly preferable for a gesture recognition solution. However, most literature work focuses on solutions with a limited range that is lower than a meter. We propose a novel architecture for a long-range (1m - 2m) gesture recognition solution that leverages a point cloud-based cross-learning approach from camera point cloud to 60-GHz FMCW radar point cloud, which allows learning better representations while suppressing noise. We use a variant of Dynamic Graph CNN (DGCNN) for the cross-learning, enabling us to model relationships between the points at a local and global level and to model the temporal dynamics a Bi-LSTM network is employed. In the experimental results section, we demonstrate our model's overall accuracy of 98.4% for five gestures and its generalization capability.
    Heterogeneous Multi-task Learning with Expert Diversity. (arXiv:2106.10595v2 [cs.LG] UPDATED)
    Predicting multiple heterogeneous biological and medical targets is a challenge for traditional deep learning models. In contrast to single-task learning, in which a separate model is trained for each target, multi-task learning (MTL) optimizes a single model to predict multiple related targets simultaneously. To address this challenge, we propose the Multi-gate Mixture-of-Experts with Exclusivity (MMoEEx). Our work aims to tackle the heterogeneous MTL setting, in which the same model optimizes multiple tasks with different characteristics. Such a scenario can overwhelm current MTL approaches due to the challenges in balancing shared and task-specific representations and the need to optimize tasks with competing optimization paths. Our method makes two key contributions: first, we introduce an approach to induce more diversity among experts, thus creating representations more suitable for highly imbalanced and heterogenous MTL learning; second, we adopt a two-step optimization [6, 11] approach to balancing the tasks at the gradient level. We validate our method on three MTL benchmark datasets, including Medical Information Mart for Intensive Care (MIMIC-III) and PubChem BioAssay (PCBA).
    On a class of data-driven mixed-integer programming problems under uncertainty: a distributionally robust approach. (arXiv:2105.14139v3 [math.OC] UPDATED)
    In this study we analyze linear mixed-integer programming problems, in which the distribution of the cost vector is only observable through a finite training data set. In contrast to the related studies, we assume that the number of random observations for each component of the cost vector may vary. Then the goal is to find a prediction rule that converts the data set into an estimate of the expected value of the objective function and a prescription rule that provides an associated estimate of the optimal decision. We aim at finding the least conservative prediction and prescription rules, which satisfy some specified asymptotic guarantees as the sample size tends to infinity. We demonstrate that under some mild assumption the resulting vector optimization problems admit a Pareto optimal solution with some attractive theoretical properties. In particular, this solution can be obtained by solving a distributionally robust optimization (DRO) problem with respect to all probability distributions with given component-wise relative entropy distances from the empirical marginal distributions. It turns out that the outlined DRO problem can be solved rather effectively whenever there exists an effective algorithm for the respective deterministic problem. In addition, we perform numerical experiments where the out-of-sample performance of the proposed approach is analyzed.
    Scalable Multi-view Clustering with Graph Filtering. (arXiv:2205.09228v1 [cs.LG])
    With the explosive growth of multi-source data, multi-view clustering has attracted great attention in recent years. Most existing multi-view methods operate in raw feature space and heavily depend on the quality of original feature representation. Moreover, they are often designed for feature data and ignore the rich topology structure information. Accordingly, in this paper, we propose a generic framework to cluster both attribute and graph data with heterogeneous features. It is capable of exploring the interplay between feature and structure. Specifically, we first adopt graph filtering technique to eliminate high-frequency noise to achieve a clustering-friendly smooth representation. To handle the scalability challenge, we develop a novel sampling strategy to improve the quality of anchors. Extensive experiments on attribute and graph benchmarks demonstrate the superiority of our approach with respect to state-of-the-art approaches.
    Improving Robustness against Real-World and Worst-Case Distribution Shifts through Decision Region Quantification. (arXiv:2205.09619v1 [cs.LG])
    The reliability of neural networks is essential for their use in safety-critical applications. Existing approaches generally aim at improving the robustness of neural networks to either real-world distribution shifts (e.g., common corruptions and perturbations, spatial transformations, and natural adversarial examples) or worst-case distribution shifts (e.g., optimized adversarial examples). In this work, we propose the Decision Region Quantification (DRQ) algorithm to improve the robustness of any differentiable pre-trained model against both real-world and worst-case distribution shifts in the data. DRQ analyzes the robustness of local decision regions in the vicinity of a given data point to make more reliable predictions. We theoretically motivate the DRQ algorithm by showing that it effectively smooths spurious local extrema in the decision surface. Furthermore, we propose an implementation using targeted and untargeted adversarial attacks. An extensive empirical evaluation shows that DRQ increases the robustness of adversarially and non-adversarially trained models against real-world and worst-case distribution shifts on several computer vision benchmark datasets.
    Robust and Efficient Medical Imaging with Self-Supervision. (arXiv:2205.09723v1 [cs.CV])
    Recent progress in Medical Artificial Intelligence (AI) has delivered systems that can reach clinical expert level performance. However, such systems tend to demonstrate sub-optimal "out-of-distribution" performance when evaluated in clinical settings different from the training environment. A common mitigation strategy is to develop separate systems for each clinical setting using site-specific data [1]. However, this quickly becomes impractical as medical data is time-consuming to acquire and expensive to annotate [2]. Thus, the problem of "data-efficient generalization" presents an ongoing difficulty for Medical AI development. Although progress in representation learning shows promise, their benefits have not been rigorously studied, specifically for out-of-distribution settings. To meet these challenges, we present REMEDIS, a unified representation learning strategy to improve robustness and data-efficiency of medical imaging AI. REMEDIS uses a generic combination of large-scale supervised transfer learning with self-supervised learning and requires little task-specific customization. We study a diverse range of medical imaging tasks and simulate three realistic application scenarios using retrospective data. REMEDIS exhibits significantly improved in-distribution performance with up to 11.5% relative improvement in diagnostic accuracy over a strong supervised baseline. More importantly, our strategy leads to strong data-efficient generalization of medical imaging AI, matching strong supervised baselines using between 1% to 33% of retraining data across tasks. These results suggest that REMEDIS can significantly accelerate the life-cycle of medical imaging AI development thereby presenting an important step forward for medical imaging AI to deliver broad impact.
    Overcoming Language Disparity in Online Content Classification with Multimodal Learning. (arXiv:2205.09744v1 [cs.LG])
    Advances in Natural Language Processing (NLP) have revolutionized the way researchers and practitioners address crucial societal problems. Large language models are now the standard to develop state-of-the-art solutions for text detection and classification tasks. However, the development of advanced computational techniques and resources is disproportionately focused on the English language, sidelining a majority of the languages spoken globally. While existing research has developed better multilingual and monolingual language models to bridge this language disparity between English and non-English languages, we explore the promise of incorporating the information contained in images via multimodal machine learning. Our comparative analyses on three detection tasks focusing on crisis information, fake news, and emotion recognition, as well as five high-resource non-English languages, demonstrate that: (a) detection frameworks based on pre-trained large language models like BERT and multilingual-BERT systematically perform better on the English language compared against non-English languages, and (b) including images via multimodal learning bridges this performance gap. We situate our findings with respect to existing work on the pitfalls of large language models, and discuss their theoretical and practical implications. Resources for this paper are available at https://multimodality-language-disparity.github.io/.
    A Causal Bandit Approach to Learning Good Atomic Interventions in Presence of Unobserved Confounders. (arXiv:2107.02772v2 [cs.LG] UPDATED)
    We study the problem of determining the best intervention in a Causal Bayesian Network (CBN) specified only by its causal graph. We model this as a stochastic multi-armed bandit (MAB) problem with side-information, where the interventions correspond to the arms of the bandit instance. First, we propose a simple regret minimization algorithm that takes as input a semi-Markovian causal graph with atomic interventions and possibly unobservable variables, and achieves $\tilde{O}(\sqrt{M/T})$ expected simple regret, where $M$ is dependent on the input CBN and could be very small compared to the number of arms. We also show that this is almost optimal for CBNs described by causal graphs having an $n$-ary tree structure. Our simple regret minimization results, both upper and lower bound, subsume previous results in the literature, which assumed additional structural restrictions on the input causal graph. In particular, our results indicate that the simple regret guarantee of our proposed algorithm can only be improved by considering more nuanced structural restrictions on the causal graph. Next, we propose a cumulative regret minimization algorithm that takes as input a general causal graph with all observable nodes and atomic interventions and performs better than the optimal MAB algorithm that does not take causal side-information into account. We also experimentally compare both our algorithms with the best known algorithms in the literature. To the best of our knowledge, this work gives the first simple and cumulative regret minimization algorithms for CBNs with general causal graphs under atomic interventions and having unobserved confounders.
    Diverse Weight Averaging for Out-of-Distribution Generalization. (arXiv:2205.09739v1 [cs.CV])
    Standard neural networks struggle to generalize under distribution shifts. For out-of-distribution generalization in computer vision, the best current approach averages the weights along a training run. In this paper, we propose Diverse Weight Averaging (DiWA) that makes a simple change to this strategy: DiWA averages the weights obtained from several independent training runs rather than from a single run. Perhaps surprisingly, averaging these weights performs well under soft constraints despite the network's nonlinearities. The main motivation behind DiWA is to increase the functional diversity across averaged models. Indeed, models obtained from different runs are more diverse than those collected along a single run thanks to differences in hyperparameters and training procedures. We motivate the need for diversity by a new bias-variance-covariance-locality decomposition of the expected error, exploiting similarities between DiWA and standard functional ensembling. Moreover, this decomposition highlights that DiWA succeeds when the variance term dominates, which we show happens when the marginal distribution changes at test time. Experimentally, DiWA consistently improves the state of the art on the competitive DomainBed benchmark without inference overhead.
    Bayesian Network Structure Learning using Digital Annealer. (arXiv:2006.06926v3 [cs.LG] UPDATED)
    Annealing processors, which solve a quadratic unconstrained binary optimization (QUBO), are a potential breakthrough in improving the accuracy of score-based Bayesian network structure learning. However, currently, the bit capacity of an annealing processor is very limited. To utilize the power of annealing processors, it is necessary to encode score-based learning problems into QUBO within the upper bound of bits. In this paper, we propose a novel approach with the decomposition of candidate parent sets. Experimental results on benchmark networks with $37$ to $223$ variables show that our approach requires lesser bits than the bit capacity of the fourth-generation Fujitsu Digital Annealer, a fully coupled annealing processor developed with semiconductor technology. Moreover, we demonstrate that the Digital Annealer with our conversion method outperforms existing algorithms on some benchmark networks. It is expected that our approach promotes the utility of annealing processors in learning the Bayesian network.
    Parallel and Distributed Graph Neural Networks: An In-Depth Concurrency Analysis. (arXiv:2205.09702v1 [cs.LG])
    Graph neural networks (GNNs) are among the most powerful tools in deep learning. They routinely solve complex problems on unstructured networks, such as node classification, graph classification, or link prediction, with high accuracy. However, both inference and training of GNNs are complex, and they uniquely combine the features of irregular graph processing with dense and regular computations. This complexity makes it very challenging to execute GNNs efficiently on modern massively parallel architectures. To alleviate this, we first design a taxonomy of parallelism in GNNs, considering data and model parallelism, and different forms of pipelining. Then, we use this taxonomy to investigate the amount of parallelism in numerous GNN models, GNN-driven machine learning tasks, software frameworks, or hardware accelerators. We use the work-depth model, and we also assess communication volume and synchronization. We specifically focus on the sparsity/density of the associated tensors, in order to understand how to effectively apply techniques such as vectorization. We also formally analyze GNN pipelining, and we generalize the established Message-Passing class of GNN models to cover arbitrary pipeline depths, facilitating future optimizations. Finally, we investigate different forms of asynchronicity, navigating the path for future asynchronous parallel GNN pipelines. The outcomes of our analysis are synthesized in a set of insights that help to maximize GNN performance, and a comprehensive list of challenges and opportunities for further research into efficient GNN computations. Our work will help to advance the design of future GNNs.
    Flexible Modeling and Multitask Learning using Differentiable Tree Ensembles. (arXiv:2205.09717v1 [cs.LG])
    Decision tree ensembles are widely used and competitive learning models. Despite their success, popular toolkits for learning tree ensembles have limited modeling capabilities. For instance, these toolkits support a limited number of loss functions and are restricted to single task learning. We propose a flexible framework for learning tree ensembles, which goes beyond existing toolkits to support arbitrary loss functions, missing responses, and multi-task learning. Our framework builds on differentiable (a.k.a. soft) tree ensembles, which can be trained using first-order methods. However, unlike classical trees, differentiable trees are difficult to scale. We therefore propose a novel tensor-based formulation of differentiable trees that allows for efficient vectorization on GPUs. We perform experiments on a collection of 28 real open-source and proprietary datasets, which demonstrate that our framework can lead to 100x more compact and 23% more expressive tree ensembles than those by popular toolkits.
    Spherical Perspective on Learning with Normalization Layers. (arXiv:2006.13382v3 [cs.LG] UPDATED)
    Normalization Layers (NLs) are widely used in modern deep-learning architectures. Despite their apparent simplicity, their effect on optimization is not yet fully understood. This paper introduces a spherical framework to study the optimization of neural networks with NLs from a geometric perspective. Concretely, the radial invariance of groups of parameters, such as filters for convolutional neural networks, allows to translate the optimization steps on the $L_2$ unit hypersphere. This formulation and the associated geometric interpretation shed new light on the training dynamics. Firstly, the first effective learning rate expression of Adam is derived. Then the demonstration that, in the presence of NLs, performing Stochastic Gradient Descent (SGD) alone is actually equivalent to a variant of Adam constrained to the unit hypersphere, stems from the framework. Finally, this analysis outlines phenomena that previous variants of Adam act on and their importance in the optimization process are experimentally validated.
    Provably Precise, Succinct and Efficient Explanations for Decision Trees. (arXiv:2205.09569v1 [cs.AI])
    Decision trees (DTs) embody interpretable classifiers. DTs have been advocated for deployment in high-risk applications, but also for explaining other complex classifiers. Nevertheless, recent work has demonstrated that predictions in DTs ought to be explained with rigorous approaches. Although rigorous explanations can be computed in polynomial time for DTs, their size may be beyond the cognitive limits of human decision makers. This paper investigates the computation of {\delta}-relevant sets for DTs. {\delta}-relevant sets denote explanations that are succinct and provably precise. These sets represent generalizations of rigorous explanations, which are precise with probability one, and so they enable trading off explanation size for precision. The paper proposes two logic encodings for computing smallest {\delta}-relevant sets for DTs. The paper further devises a polynomial-time algorithm for computing {\delta}-relevant sets which are not guaranteed to be subset-minimal, but for which the experiments show to be most often subset-minimal in practice. The experimental results also demonstrate the practical efficiency of computing smallest {\delta}-relevant sets.
    Metrics of calibration for probabilistic predictions. (arXiv:2205.09680v1 [math.ST])
    Predictions are often probabilities; e.g., a prediction could be for precipitation tomorrow, but with only a 30% chance. Given such probabilistic predictions together with the actual outcomes, "reliability diagrams" help detect and diagnose statistically significant discrepancies -- so-called "miscalibration" -- between the predictions and the outcomes. The canonical reliability diagrams histogram the observed and expected values of the predictions; replacing the hard histogram binning with soft kernel density estimation is another common practice. But, which widths of bins or kernels are best? Plots of the cumulative differences between the observed and expected values largely avoid this question, by displaying miscalibration directly as the slopes of secant lines for the graphs. Slope is easy to perceive with quantitative precision, even when the constant offsets of the secant lines are irrelevant; there is no need to bin or perform kernel density estimation. The existing standard metrics of miscalibration each summarize a reliability diagram as a single scalar statistic. The cumulative plots naturally lead to scalar metrics for the deviation of the graph of cumulative differences away from zero; good calibration corresponds to a horizontal, flat graph which deviates little from zero. The cumulative approach is currently unconventional, yet offers many favorable statistical properties, guaranteed via mathematical theory backed by rigorous proofs and illustrative numerical examples. In particular, metrics based on binning or kernel density estimation unavoidably must trade-off statistical confidence for the ability to resolve variations as a function of the predicted probability or vice versa. Widening the bins or kernels averages away random noise while giving up some resolving power. Narrowing the bins or kernels enhances resolving power while not averaging away as much noise.
    Dexterous Robotic Manipulation using Deep Reinforcement Learning and Knowledge Transfer for Complex Sparse Reward-based Tasks. (arXiv:2205.09683v1 [cs.RO])
    This paper describes a deep reinforcement learning (DRL) approach that won Phase 1 of the Real Robot Challenge (RRC) 2021, and then extends this method to a more difficult manipulation task. The RRC consisted of using a TriFinger robot to manipulate a cube along a specified positional trajectory, but with no requirement for the cube to have any specific orientation. We used a relatively simple reward function, a combination of goal-based sparse reward and distance reward, in conjunction with Hindsight Experience Replay (HER) to guide the learning of the DRL agent (Deep Deterministic Policy Gradient (DDPG)). Our approach allowed our agents to acquire dexterous robotic manipulation strategies in simulation. These strategies were then applied to the real robot and outperformed all other competition submissions, including those using more traditional robotic control techniques, in the final evaluation stage of the RRC. Here we extend this method, by modifying the task of Phase 1 of the RRC to require the robot to maintain the cube in a particular orientation, while the cube is moved along the required positional trajectory. The requirement to also orient the cube makes the agent unable to learn the task through blind exploration due to increased problem complexity. To circumvent this issue, we make novel use of a Knowledge Transfer (KT) technique that allows the strategies learned by the agent in the original task (which was agnostic to cube orientation) to be transferred to this task (where orientation matters). KT allowed the agent to learn and perform the extended task in the simulator, which improved the average positional deviation from 0.134 m to 0.02 m, and average orientation deviation from 142{\deg} to 76{\deg} during evaluation. This KT concept shows good generalisation properties and could be applied to any actor-critic learning algorithm.
    Neural network topological snake models for locating general phase diagrams. (arXiv:2205.09699v1 [cond-mat.stat-mech])
    Machine learning for locating phase diagram has received intensive research interest in recent years. However, its application in automatically locating phase diagram is limited to single closed phase boundary. In this paper, in order to locate phase diagrams with multiple phases and complex boundaries, we introduce (i) a network-shaped snake model and (ii) a topologically transformable snake with discriminative cooperative networks, respectively. The phase diagrams of both quantum and classical spin-1 model are obtained. Our method is flexible to determine the phase diagram with just snapshots of configurations from the cold-atom or other experiments.
    Semi-WTC: A Practical Semi-supervised Framework for Attack Categorization through Weight-Task Consistency. (arXiv:2205.09669v1 [cs.CR])
    Supervised learning has been widely used for attack detection, which requires large amounts of high-quality data and labels. However, the data is often imbalanced and sufficient annotations are difficult to obtain. Moreover, these supervised models are subject to real-world deployment issues, such as defending against unseen artificial attacks. We propose a semi-supervised fine-grained attack categorization framework consisting of an encoder and a two-branch structure to integrate information from labeled and unlabeled data to tackle these practical challenges. This framework can be generalized to different supervised models. The multilayer perceptron with residual connection and batch normalization is used as the encoder to extract features and reduce the complexity. The Recurrent Prototype Module (RPM) is proposed to train the encoder effectively in a semi-supervised manner. To alleviate the problem of data imbalance, we introduce the Weight-Task Consistency (WTC) into the iterative process of RPM by assigning larger weights to classes with fewer samples in the loss function. In addition, to cope with new attacks in real-world deployment, we further propose an Active Adaption Resampling (AAR) method, which can better discover the distribution of the unseen sample data and adapt the parameters of the encoder. Experimental results show that our model outperforms the state-of-the-art semi-supervised attack detection methods with a general 5% improvement in classification accuracy and a 90% reduction in training time.
    Discovering Dynamic Functional Brain Networks via Spatial and Channel-wise Attention. (arXiv:2205.09576v1 [cs.CV])
    Using deep learning models to recognize functional brain networks (FBNs) in functional magnetic resonance imaging (fMRI) has been attracting increasing interest recently. However, most existing work focuses on detecting static FBNs from entire fMRI signals, such as correlation-based functional connectivity. Sliding-window is a widely used strategy to capture the dynamics of FBNs, but it is still limited in representing intrinsic functional interactive dynamics at each time step. And the number of FBNs usually need to be set manually. More over, due to the complexity of dynamic interactions in brain, traditional linear and shallow models are insufficient in identifying complex and spatially overlapped FBNs across each time step. In this paper, we propose a novel Spatial and Channel-wise Attention Autoencoder (SCAAE) for discovering FBNs dynamically. The core idea of SCAAE is to apply attention mechanism to FBNs construction. Specifically, we designed two attention modules: 1) spatial-wise attention (SA) module to discover FBNs in the spatial domain and 2) a channel-wise attention (CA) module to weigh the channels for selecting the FBNs automatically. We evaluated our approach on ADHD200 dataset and our results indicate that the proposed SCAAE method can effectively recover the dynamic changes of the FBNs at each fMRI time step, without using sliding windows. More importantly, our proposed hybrid attention modules (SA and CA) do not enforce assumptions of linearity and independence as previous methods, and thus provide a novel approach to better understanding dynamic functional brain networks.
    Jacobian Granger Causal Neural Networks for Analysis of Stationary and Nonstationary Data. (arXiv:2205.09573v1 [cs.LG])
    Granger causality is a commonly used method for uncovering information flow and dependencies in a time series. Here we introduce JGC (Jacobian Granger Causality), a neural network-based approach to Granger causality using the Jacobian as a measure of variable importance, and propose a thresholding procedure for inferring Granger causal variables using this measure. The resulting approach performs consistently well compared to other approaches in identifying Granger causal variables, the associated time lags, as well as interaction signs. Lastly, through the inclusion of a time variable, we show that this approach is able to learn the temporal dependencies for nonstationary systems whose Granger causal structures change in time.
    Detect Professional Malicious User with Metric Learning in Recommender Systems. (arXiv:2205.09673v1 [cs.IR])
    In e-commerce, online retailers are usually suffering from professional malicious users (PMUs), who utilize negative reviews and low ratings to their consumed products on purpose to threaten the retailers for illegal profits. Specifically, there are three challenges for PMU detection: 1) professional malicious users do not conduct any abnormal or illegal interactions (they never concurrently leave too many negative reviews and low ratings at the same time), and they conduct masking strategies to disguise themselves. Therefore, conventional outlier detection methods are confused by their masking strategies. 2) the PMU detection model should take both ratings and reviews into consideration, which makes PMU detection a multi-modal problem. 3) there are no datasets with labels for professional malicious users in public, which makes PMU detection an unsupervised learning problem. To this end, we propose an unsupervised multi-modal learning model: MMD, which employs Metric learning for professional Malicious users Detection with both ratings and reviews. MMD first utilizes a modified RNN to project the informational review into a sentiment score, which jointly considers the ratings and reviews. Then professional malicious user profiling (MUP) is proposed to catch the sentiment gap between sentiment scores and ratings. MUP filters the users and builds a candidate PMU set. We apply a metric learning-based clustering to learn a proper metric matrix for PMU detection. Finally, we can utilize this metric and labeled users to detect PMUs. Specifically, we apply the attention mechanism in metric learning to improve the model's performance. The extensive experiments in four datasets demonstrate that our proposed method can solve this unsupervised detection problem. Moreover, the performance of the state-of-the-art recommender models is enhanced by taking MMD as a preprocessing stage.
    Are Graph Representation Learning Methods Robust to Graph Sparsity and Asymmetric Node Information?. (arXiv:2205.09648v1 [cs.LG])
    The growing popularity of Graph Representation Learning (GRL) methods has resulted in the development of a large number of models applied to a miscellany of domains. Behind this diversity of domains, there is a strong heterogeneity of graphs, making it difficult to estimate the expected performance of a model on a new graph, especially when the graph has distinctive characteristics that have not been encountered in the benchmark yet. To address this, we have developed an experimental pipeline, to assess the impact of a given property on the models performances. In this paper, we use this pipeline to study the effect of two specificities encountered on banks transactional graphs resulting from the partial view a bank has on all the individuals and transactions carried out on the market. These specific features are graph sparsity and asymmetric node information. This study demonstrates the robustness of GRL methods to these distinctive characteristics. We believe that this work can ease the evaluation of GRL methods to specific characteristics and foster the development of such methods on transactional graphs.
    HyperAid: Denoising in hyperbolic spaces for tree-fitting and hierarchical clustering. (arXiv:2205.09721v1 [cs.LG])
    The problem of fitting distances by tree-metrics has received significant attention in the theoretical computer science and machine learning communities alike, due to many applications in natural language processing, phylogeny, cancer genomics and a myriad of problem areas that involve hierarchical clustering. Despite the existence of several provably exact algorithms for tree-metric fitting of data that inherently obeys tree-metric constraints, much less is known about how to best fit tree-metrics for data whose structure moderately (or substantially) differs from a tree. For such noisy data, most available algorithms perform poorly and often produce negative edge weights in representative trees. Furthermore, it is currently not known how to choose the most suitable approximation objective for noisy fitting. Our contributions are as follows. First, we propose a new approach to tree-metric denoising (HyperAid) in hyperbolic spaces which transforms the original data into data that is ``more'' tree-like, when evaluated in terms of Gromov's $\delta$ hyperbolicity. Second, we perform an ablation study involving two choices for the approximation objective, $\ell_p$ norms and the Dasgupta loss. Third, we integrate HyperAid with schemes for enforcing nonnegative edge-weights. As a result, the HyperAid platform outperforms all other existing methods in the literature, including Neighbor Joining (NJ), TreeRep and T-REX, both on synthetic and real-world data. Synthetic data is represented by edge-augmented trees and shortest-distance metrics while the real-world datasets include Zoo, Iris, Glass, Segmentation and SpamBase; on these datasets, the average improvement with respect to NJ is $125.94\%$.
    Extract Dynamic Information To Improve Time Series Modeling: a Case Study with Scientific Workflow. (arXiv:2205.09703v1 [cs.LG])
    In modeling time series data, we often need to augment the existing data records to increase the modeling accuracy. In this work, we describe a number of techniques to extract dynamic information about the current state of a large scientific workflow, which could be generalized to other types of applications. The specific task to be modeled is the time needed for transferring a file from an experimental facility to a data center. The key idea of our approach is to find recent past data transfer events that match the current event in some ways. Tests showed that we could identify recent events matching some recorded properties and reduce the prediction error by about 12% compared to the similar models with only static features. We additionally explored an application specific technique to extract information about the data production process, and was able to reduce the average prediction error by 44%.
    What killed the Convex Booster ?. (arXiv:2205.09628v1 [cs.LG])
    A landmark negative result of Long and Servedio established a worst-case spectacular failure of a supervised learning trio (loss, algorithm, model) otherwise praised for its high precision machinery. Hundreds of papers followed up on the two suspected culprits: the loss (for being convex) and/or the algorithm (for fitting a classical boosting blueprint). Here, we call to the half-century+ founding theory of losses for class probability estimation (properness), an extension of Long and Servedio's results and a new general boosting algorithm to demonstrate that the real culprit in their specific context was in fact the (linear) model class. We advocate for a more general stanpoint on the problem as we argue that the source of the negative result lies in the dark side of a pervasive -- and otherwise prized -- aspect of ML: \textit{parameterisation}.
    EXACT: How to Train Your Accuracy. (arXiv:2205.09615v1 [cs.LG])
    Classification tasks are usually evaluated in terms of accuracy. However, accuracy is discontinuous and cannot be directly optimized using gradient ascent. Popular methods minimize cross-entropy, Hinge loss, or other surrogate losses, which can lead to suboptimal results. In this paper, we propose a new optimization framework by introducing stochasticity to a model's output and optimizing expected accuracy, i.e. accuracy of the stochastic model. Extensive experiments on image classification show that the proposed optimization method is a powerful alternative to widely used classification losses.
    Named Entity Recognition, Multi-Task Learning, Nested Entities, BERT, Arabic NER Corpus. (arXiv:2205.09651v1 [cs.CL])
    This paper presents Wojood, a corpus for Arabic nested Named Entity Recognition (NER). Nested entities occur when one entity mention is embedded inside another entity mention. Wojood consists of about 550K Modern Standard Arabic (MSA) and dialect tokens that are manually annotated with 21 entity types including person, organization, location, event and date. More importantly, the corpus is annotated with nested entities instead of the more common flat annotations. The data contains about 75K entities and 22.5% of which are nested. The inter-annotator evaluation of the corpus demonstrated a strong agreement with Cohen's Kappa of 0.979 and an F1-score of 0.976. To validate our data, we used the corpus to train a nested NER model based on multi-task learning and AraBERT (Arabic BERT). The model achieved an overall micro F1-score of 0.884. Our corpus, the annotation guidelines, the source code and the pre-trained model are publicly available.
    The AI Mechanic: Acoustic Vehicle Characterization Neural Networks. (arXiv:2205.09667v1 [cs.SD])
    In a world increasingly dependent on road-based transportation, it is essential to understand vehicles. We introduce the AI mechanic, an acoustic vehicle characterization deep learning system, as an integrated approach using sound captured from mobile devices to enhance transparency and understanding of vehicles and their condition for non-expert users. We develop and implement novel cascading architectures for vehicle understanding, which we define as sequential, conditional, multi-level networks that process raw audio to extract highly-granular insights. To showcase the viability of cascading architectures, we build a multi-task convolutional neural network that predicts and cascades vehicle attributes to enhance fault detection. We train and test these models on a synthesized dataset reflecting more than 40 hours of augmented audio and achieve >92% validation set accuracy on attributes (fuel type, engine configuration, cylinder count and aspiration type). Our cascading architecture additionally achieved 93.6% validation and 86.8% test set accuracy on misfire fault prediction, demonstrating margins of 16.4% / 7.8% and 4.2% / 1.5% improvement over na\"ive and parallel baselines. We explore experimental studies focused on acoustic features, data augmentation, feature fusion, and data reliability. Finally, we conclude with a discussion of broader implications, future directions, and application areas for this work.
    ArabGlossBERT: Fine-Tuning BERT on Context-Gloss Pairs for WSD. (arXiv:2205.09685v1 [cs.CL])
    Using pre-trained transformer models such as BERT has proven to be effective in many NLP tasks. This paper presents our work to fine-tune BERT models for Arabic Word Sense Disambiguation (WSD). We treated the WSD task as a sentence-pair binary classification task. First, we constructed a dataset of labeled Arabic context-gloss pairs (~167k pairs) we extracted from the Arabic Ontology and the large lexicographic database available at Birzeit University. Each pair was labeled as True or False and target words in each context were identified and annotated. Second, we used this dataset for fine-tuning three pre-trained Arabic BERT models. Third, we experimented the use of different supervised signals used to emphasize target words in context. Our experiments achieved promising results (accuracy of 84%) although we used a large set of senses in the experiment.
    A Topological Approach for Semi-Supervised Learning. (arXiv:2205.09617v1 [cs.CV])
    Nowadays, Machine Learning and Deep Learning methods have become the state-of-the-art approach to solve data classification tasks. In order to use those methods, it is necessary to acquire and label a considerable amount of data; however, this is not straightforward in some fields, since data annotation is time consuming and might require expert knowledge. This challenge can be tackled by means of semi-supervised learning methods that take advantage of both labelled and unlabelled data. In this work, we present new semi-supervised learning methods based on techniques from Topological Data Analysis (TDA), a field that is gaining importance for analysing large amounts of data with high variety and dimensionality. In particular, we have created two semi-supervised learning methods following two different topological approaches. In the former, we have used a homological approach that consists in studying the persistence diagrams associated with the data using the Bottleneck and Wasserstein distances. In the latter, we have taken into account the connectivity of the data. In addition, we have carried out a thorough analysis of the developed methods using 3 synthetic datasets, 5 structured datasets, and 2 datasets of images. The results show that the semi-supervised methods developed in this work outperform both the results obtained with models trained with only manually labelled data, and those obtained with classical semi-supervised learning methods, reaching improvements of up to a 16%.
    Data Valuation for Offline Reinforcement Learning. (arXiv:2205.09550v1 [cs.LG])
    The success of deep reinforcement learning (DRL) hinges on the availability of training data, which is typically obtained via a large number of environment interactions. In many real-world scenarios, costs and risks are associated with gathering these data. The field of offline reinforcement learning addresses these issues through outsourcing the collection of data to a domain expert or a carefully monitored program and subsequently searching for a batch-constrained optimal policy. With the emergence of data markets, an alternative to constructing a dataset in-house is to purchase external data. However, while state-of-the-art offline reinforcement learning approaches have shown a lot of promise, they currently rely on carefully constructed datasets that are well aligned with the intended target domains. This raises questions regarding the transferability and robustness of an offline reinforcement learning agent trained on externally acquired data. In this paper, we empirically evaluate the ability of the current state-of-the-art offline reinforcement learning approaches to coping with the source-target domain mismatch within two MuJoCo environments, finding that current state-of-the-art offline reinforcement learning algorithms underperform in the target domain. To address this, we propose data valuation for offline reinforcement learning (DVORL), which allows us to identify relevant and high-quality transitions, improving the performance and transferability of policies learned by offline reinforcement learning algorithms. The results show that our method outperforms offline reinforcement learning baselines on two MuJoCo environments.
    Focused Adversarial Attacks. (arXiv:2205.09624v1 [cs.LG])
    Recent advances in machine learning show that neural models are vulnerable to minimally perturbed inputs, or adversarial examples. Adversarial algorithms are optimization problems that minimize the accuracy of ML models by perturbing inputs, often using a model's loss function to craft such perturbations. State-of-the-art object detection models are characterized by very large output manifolds due to the number of possible locations and sizes of objects in an image. This leads to their outputs being sparse and optimization problems that use them incur a lot of unnecessary computation. We propose to use a very limited subset of a model's learned manifold to compute adversarial examples. Our \textit{Focused Adversarial Attacks} (FA) algorithm identifies a small subset of sensitive regions to perform gradient-based adversarial attacks. FA is significantly faster than other gradient-based attacks when a model's manifold is sparsely activated. Also, its perturbations are more efficient than other methods under the same perturbation constraints. We evaluate FA on the COCO 2017 and Pascal VOC 2007 detection datasets.
    How catastrophic can catastrophic forgetting be in linear regression?. (arXiv:2205.09588v1 [cs.LG])
    To better understand catastrophic forgetting, we study fitting an overparameterized linear model to a sequence of tasks with different input distributions. We analyze how much the model forgets the true labels of earlier tasks after training on subsequent tasks, obtaining exact expressions and bounds. We establish connections between continual learning in the linear setting and two other research areas: alternating projections and the Kaczmarz method. In specific settings, we highlight differences between forgetting and convergence to the offline solution as studied in those areas. In particular, when T tasks in d dimensions are presented cyclically for k iterations, we prove an upper bound of T^2 * min{1/sqrt(k), d/k} on the forgetting. This stands in contrast to the convergence to the offline solution, which can be arbitrarily slow according to existing alternating projection results. We further show that the T^2 factor can be lifted when tasks are presented in a random ordering.
    Disentangling Active and Passive Cosponsorship in the U.S. Congress. (arXiv:2205.09674v1 [cs.LG])
    In the U.S. Congress, legislators can use active and passive cosponsorship to support bills. We show that these two types of cosponsorship are driven by two different motivations: the backing of political colleagues and the backing of the bill's content. To this end, we develop an Encoder+RGCN based model that learns legislator representations from bill texts and speech transcripts. These representations predict active and passive cosponsorship with an F1-score of 0.88. Applying our representations to predict voting decisions, we show that they are interpretable and generalize to unseen tasks.
    Smooth densities and generative modeling with unsupervised random forests. (arXiv:2205.09435v1 [stat.ML])
    Density estimation is a fundamental problem in statistics, and any attempt to do so in high dimensions typically requires strong assumptions or complex deep learning architectures. An important application for density estimators is synthetic data generation, an area currently dominated by neural networks that often demand enormous training datasets and extensive tuning. We propose a new method based on unsupervised random forests for estimating smooth densities in arbitrary dimensions without parametric constraints, as well as generating realistic synthetic data. We prove the consistency of our approach and demonstrate its advantages over existing tree-based density estimators, which generally rely on ill-chosen split criteria and do not scale well with data dimensionality. Experiments illustrate that our algorithm compares favorably to state-of-the-art deep learning generative models, achieving superior performance in a range of benchmark trials while executing about two orders of magnitude faster on average. Our method is implemented in easy-to-use $\texttt{R}$ and Python packages.
    IFTT-PIN: A PIN-Entry Method Leveraging the Self-Calibration Paradigm. (arXiv:2205.09534v1 [cs.HC])
    IFTT-PIN is a self-calibrating version of the PIN-entry method introduced in Roth et al. (2004) [1]. In [1], digits are split into two sets and assigned a color respectively. To communicate their digit, users press the button with the same color that is assigned to their digit, which can thus be identified by elimination after a few iterations. IFTT-PIN uses the same principle but does not pre-assign colors to each button. Instead, users are free to choose which button to use for each color. The button-to-color mapping only exists in the user's mind and is never directly communicated to the interface. In other words, IFTT-PIN infers both the user's PIN and their preferred button-to-color mapping at the same time, a process called self-calibration. In this paper, we present online interactive demonstrations of IFTT-PIN (available at https://github.com/jgrizou/IFTT-PIN), with and without self-calibration, and introduce the key concepts and assumptions making self-calibration possible. We review related work in the field of brain-computer interface and further propose self-calibration as a novel approach to protect users against shoulder surfing attacks. Finally, we introduce a vault cracking challenge as a test of usability and security that was informally tested at our institute. With IFTT-PIN, we wish to demonstrate a new interactive experience where users can decide actively and on-the-fly how to use an interface. The self-calibration paradigm might lead to novel opportunities for interaction in other applications or domains. We hope this work will inspire the community to invent them.
    Data-driven prediction of Air Traffic Controllers reactions to resolving conflicts. (arXiv:2205.09539v1 [cs.AI])
    With the aim to enhance automation in conflict detection and resolution (CD&R) tasks in the Air Traffic Management domain, in this paper we propose deep learning techniques (DL) that can learn models of Air Traffic Controllers' (ATCO) reactions in resolving conflicts that can violate separation minimum constraints among aircraft trajectories: This implies learning when the ATCO will react towards resolving a conflict, and how he/she will react. Timely reactions, to which this paper aims, focus on when do reactions happen, aiming to predict the trajectory points, as the trajectory evolves, that the ATCO issues a conflict resolution action, while also predicting the type of resolution action (if any). Towards this goal, the paper formulates the ATCO reactions prediction problem for CD&R, and presents DL methods that can model ATCO timely reactions and evaluates these methods in real-world data sets, showing their efficacy in prediction with very high accuracy.
    Parallel bandit architecture based on laser chaos for reinforcement learning. (arXiv:2205.09543v1 [cs.ET])
    Accelerating artificial intelligence by photonics is an active field of study aiming to exploit the unique properties of photons. Reinforcement learning is an important branch of machine learning, and photonic decision-making principles have been demonstrated with respect to the multi-armed bandit problems. However, reinforcement learning could involve a massive number of states, unlike previously demonstrated bandit problems where the number of states is only one. Q-learning is a well-known approach in reinforcement learning that can deal with many states. The architecture of Q-learning, however, does not fit well photonic implementations due to its separation of update rule and the action selection. In this study, we organize a new architecture for multi-state reinforcement learning as a parallel array of bandit problems in order to benefit from photonic decision-makers, which we call parallel bandit architecture for reinforcement learning or PBRL in short. Taking a cart-pole balancing problem as an instance, we demonstrate that PBRL adapts to the environment in fewer time steps than Q-learning. Furthermore, PBRL yields faster adaptation when operated with a chaotic laser time series than the case with uniformly distributed pseudorandom numbers where the autocorrelation inherent in the laser chaos provides a positive effect. We also find that the variety of states that the system undergoes during the learning phase exhibits completely different properties between PBRL and Q-learning. The insights obtained through the present study are also beneficial for existing computing platforms, not just photonic realizations, in accelerating performances by the PBRL algorithms and correlated random sequences.
    Nebula-I: A General Framework for Collaboratively Training Deep Learning Models on Low-Bandwidth Cloud Clusters. (arXiv:2205.09470v1 [cs.LG])
    The ever-growing model size and scale of compute have attracted increasing interests in training deep learning models over multiple nodes. However, when it comes to training on cloud clusters, especially across remote clusters, huge challenges are faced. In this work, we introduce a general framework, Nebula-I, for collaboratively training deep learning models over remote heterogeneous clusters, the connections between which are low-bandwidth wide area networks (WANs). We took natural language processing (NLP) as an example to show how Nebula-I works in different training phases that include: a) pre-training a multilingual language model using two remote clusters; and b) fine-tuning a machine translation model using knowledge distilled from pre-trained models, which run through the most popular paradigm of recent deep learning. To balance the accuracy and communication efficiency, in Nebula-I, parameter-efficient training strategies, hybrid parallel computing methods and adaptive communication acceleration techniques are jointly applied. Meanwhile, security strategies are employed to guarantee the safety, reliability and privacy in intra-cluster computation and inter-cluster communication. Nebula-I is implemented with the PaddlePaddle deep learning framework, which can support collaborative training over heterogeneous hardware, e.g. GPU and NPU. Experiments demonstrate that the proposed framework could substantially maximize the training efficiency while preserving satisfactory NLP performance. By using Nebula-I, users can run large-scale training tasks over cloud clusters with minimum developments, and the utility of existed large pre-trained models could be further promoted. We also introduced new state-of-the-art results on cross-lingual natural language inference tasks, which are generated based upon a novel learning framework and Nebula-I.
    The Impact of COVID-19 Pandemic on LGBTQ Online Communitie. (arXiv:2205.09511v1 [cs.SI])
    The COVID-19 pandemic has disproportionately impacted the lives of minorities, such as members of the LGBTQ community (lesbian, gay, bisexual, transgender, and queer) due to pre-existing social disadvantages and health disparities. Although extensive research has been carried out on the impact of the COVID-19 pandemic on different aspects of the general population's lives, few studies are focused on the LGBTQ population. In this paper, we identify a group of Twitter users who self-disclose to belong to the LGBTQ community. We develop and evaluate two sets of machine learning classifiers using a pre-pandemic and a during pandemic dataset to identify Twitter posts exhibiting minority stress, which is a unique pressure faced by the members of the LGBTQ population due to their sexual and gender identities. For this task, we collect a set of 20,593,823 posts by 7,241 self-disclosed LGBTQ users and annotate a randomly selected subset of 2800 posts. We demonstrate that our best pre-pandemic and during pandemic models show strong and stable performance for detecting posts that contain minority stress. We investigate the linguistic differences in minority stress posts across pre- and during-pandemic periods. We find that anger words are strongly associated with minority stress during the COVID-19 pandemic. We explore the impact of the pandemic on the emotional states of the LGBTQ population by conducting controlled comparisons with the general population. We adopt propensity score-based matching to perform a causal analysis. The results show that the LBGTQ population have a greater increase in the usage of cognitive words and worsened observable attribute in the usage of positive emotion words than the group of the general population with similar pre-pandemic behavioral attributes.
    Transformers as Neural Augmentors: Class Conditional Sentence Generation via Variational Bayes. (arXiv:2205.09391v1 [cs.CL])
    Data augmentation methods for Natural Language Processing tasks are explored in recent years, however they are limited and it is hard to capture the diversity on sentence level. Besides, it is not always possible to perform data augmentation on supervised tasks. To address those problems, we propose a neural data augmentation method, which is a combination of Conditional Variational Autoencoder and encoder-decoder Transformer model. While encoding and decoding the input sentence, our model captures the syntactic and semantic representation of the input language with its class condition. Following the developments in the past years on pre-trained language models, we train and evaluate our models on several benchmarks to strengthen the downstream tasks. We compare our method with 3 different augmentation techniques. The presented results show that, our model increases the performance of current models compared to other data augmentation techniques with a small amount of computation power.
    Action Conditioned Tactile Prediction: a case study on slip prediction. (arXiv:2205.09430v1 [cs.RO])
    Tactile predictive models can be useful across several robotic manipulation tasks, e.g. robotic pushing, robotic grasping, slip avoidance, and in-hand manipulation. However, available tactile prediction models are mostly studied for image-based tactile sensors and there is no comparison study indicating the best performing models. In this paper, we presented two novel data-driven action-conditioned models for predicting tactile signals during real-world physical robot interaction tasks (1) action condition tactile prediction and (2) action conditioned tactile-video prediction models. We use a magnetic-based tactile sensor that is challenging to analyse and test state-of-the-art predictive models and the only existing bespoke tactile prediction model. We compare the performance of these models with those of our proposed models. We perform the comparison study using our novel tactile enabled dataset containing 51,000 tactile frames of a real-world robotic manipulation task with 11 flat-surfaced household objects. Our experimental results demonstrate the superiority of our proposed tactile prediction models in terms of qualitative, quantitative and slip prediction scores.
    A Boosting Algorithm for Positive-Unlabeled Learning. (arXiv:2205.09485v1 [cs.LG])
    Positive-unlabeled (PU) learning deals with binary classification problems when only positive (P) and unlabeled (U) data are available. A lot of PU methods based on linear models and neural networks have been proposed; however, there still lacks study on how the theoretically sound boosting-style algorithms could work with P and U data. Considering that in some scenarios when neural networks cannot perform as good as boosting algorithms even with fully-supervised data, we propose a novel boosting algorithm for PU learning: Ada-PU, which compares against neural networks. Ada-PU follows the general procedure of AdaBoost while two different distributions of P data are maintained and updated. After a weak classifier is learned on the newly updated distribution, the corresponding combining weight for the final ensemble is estimated using only PU data. We demonstrated that with a smaller set of base classifiers, the proposed method is guaranteed to keep the theoretical properties of boosting algorithm. In experiments, we showed that Ada-PU outperforms neural networks on benchmark PU datasets. We also study a real-world dataset UNSW-NB15 in cyber security and demonstrated that Ada-PU has superior performance for malicious activities detection.
    Predictive Maintenance using Machine Learning. (arXiv:2205.09402v1 [cs.LG])
    Predictive maintenance (PdM) is a concept, which is implemented to effectively manage maintenance plans of the assets by predicting their failures with data driven techniques. In these scenarios, data is collected over a certain period of time to monitor the state of equipment. The objective is to find some correlations and patterns that can help predict and ultimately prevent failures. Equipment in manufacturing industry are often utilized without a planned maintenance approach. Such practise frequently results in unexpected downtime, owing to certain unexpected failures. In scheduled maintenance, the condition of the manufacturing equipment is checked after fixed time interval and if any fault occurs, the component is replaced to avoid unexpected equipment stoppages. On the flip side, this leads to increase in time for which machine is non-functioning and cost of carrying out the maintenance. The emergence of Industry 4.0 and smart systems have led to increasing emphasis on predictive maintenance (PdM) strategies that can reduce the cost of downtime and increase the availability (utilization rate) of manufacturing equipment. PdM also has the potential to bring about new sustainable practices in manufacturing by fully utilizing the useful lives of components.
    Simple Regularisation for Uncertainty-Aware Knowledge Distillation. (arXiv:2205.09526v1 [cs.LG])
    Considering uncertainty estimation of modern neural networks (NNs) is one of the most important steps towards deploying machine learning systems to meaningful real-world applications such as in medicine, finance or autonomous systems. At the moment, ensembles of different NNs constitute the state-of-the-art in both accuracy and uncertainty estimation in different tasks. However, ensembles of NNs are unpractical under real-world constraints, since their computation and memory consumption scale linearly with the size of the ensemble, which increase their latency and deployment cost. In this work, we examine a simple regularisation approach for distribution-free knowledge distillation of ensemble of machine learning models into a single NN. The aim of the regularisation is to preserve the diversity, accuracy and uncertainty estimation characteristics of the original ensemble without any intricacies, such as fine-tuning. We demonstrate the generality of the approach on combinations of toy data, SVHN/CIFAR-10, simple to complex NN architectures and different tasks.
    Threshold Designer Adaptation: Improved Adaptation for Designers in Co-creative Systems. (arXiv:2205.09269v1 [cs.LG])
    To best assist human designers with different styles, Machine Learning (ML) systems need to be able to adapt to them. However, there has been relatively little prior work on how and when to best adapt an ML system to a co-designer. In this paper we present threshold designer adaptation: a novel method for adapting a creative ML model to an individual designer. We evaluate our approach with a human subject study using a co-creative rhythm game design tool. We find that designers prefer our proposed method and produce higher quality content in comparison to an existing baseline.
    CAMEO: Curiosity Augmented Metropolis for Exploratory Optimal Policies. (arXiv:2205.09433v1 [cs.LG])
    Reinforcement Learning has drawn huge interest as a tool for solving optimal control problems. Solving a given problem (task or environment) involves converging towards an optimal policy. However, there might exist multiple optimal policies that can dramatically differ in their behaviour; for example, some may be faster than the others but at the expense of greater risk. We consider and study a distribution of optimal policies. We design a curiosity-augmented Metropolis algorithm (CAMEO), such that we can sample optimal policies, and such that these policies effectively adopt diverse behaviours, since this implies greater coverage of the different possible optimal policies. In experimental simulations we show that CAMEO indeed obtains policies that all solve classic control problems, and even in the challenging case of environments that provide sparse rewards. We further show that the different policies we sample present different risk profiles, corresponding to interesting practical applications in interpretability, and represents a first step towards learning the distribution of optimal policies itself.
    An Approach to Investigate Public Opinion, Views, and Perspectives Towards Exoskeleton Technology. (arXiv:2205.09151v1 [cs.HC])
    Over the last decade, exoskeletons have had an extensive impact on different disciplines and application domains such as assisted living, military, healthcare, firefighting, and industries, on account of their diverse and dynamic functionalities to augment human abilities, stamina, potential, and performance in a multitude of ways. In view of this wide-scale applicability and use-cases of exoskeletons, it is crucial to investigate and analyze the public opinion, views, and perspectives towards exoskeletons which would help to interpret the effectiveness of the underlining human-robot, human-machine, and human-technology interactions. The Internet of Everything era of today's living, characterized by people spending more time on the internet than ever before, holds the potential for the investigation of the same by mining and analyzing relevant web behavior, specifically from social media, that can be interpreted to understand public opinion, views, and perspectives towards a topic or set of topics. Therefore, this paper aims to address this research challenge related to exoskeletons by utilizing the potential of web behavior-based Big Data mining in the modern-day Internet of Everything era. As Twitter is one of the most popular social media platforms on a global scale - characterized by both the number of users and the amount of time spent by its users on the platform - this work focused on investigating web behavior on Twitter to interpret the public opinion, views, and perspectives towards exoskeleton technology. A total of approximately 20,000 tweets related to exoskeletons were used to evaluate the effectiveness of the proposed approach. The results presented and discussed uphold the efficacy of the proposed approach to interpret and analyze the public opinion, views, and perspectives towards exoskeletons from the associated tweets.
    COVID-19 Monitoring System using Social Distancing and Face Mask Detection on Surveillance video datasets. (arXiv:2110.03905v2 [cs.CV] UPDATED)
    In the current times, the fear and danger of COVID-19 virus still stands large. Manual monitoring of social distancing norms is impractical with a large population moving about and with insufficient task force and resources to administer them. There is a need for a lightweight, robust and 24X7 video-monitoring system that automates this process. This paper proposes a comprehensive and effective solution to perform person detection, social distancing violation detection, face detection and face mask classification using object detection, clustering and Convolution Neural Network (CNN) based binary classifier. For this, YOLOv3, Density-based spatial clustering of applications with noise (DBSCAN), Dual Shot Face Detector (DSFD) and MobileNetV2 based binary classifier have been employed on surveillance video datasets. This paper also provides a comparative study of different face detection and face mask classification models. Finally, a video dataset labelling method is proposed along with the labelled video dataset to compensate for the lack of dataset in the community and is used for evaluation of the system. The system performance is evaluated in terms of accuracy, F1 score as well as the prediction time, which has to be low for practical applicability. The system performs with an accuracy of 91.2% and F1 score of 90.79% on the labelled video dataset and has an average prediction time of 7.12 seconds for 78 frames of a video.
    Turbulent field fluctuations in gyrokinetic and fluid plasmas. (arXiv:2107.09744v2 [physics.plasm-ph] CROSS LISTED)
    A key uncertainty in the design and development of magnetic confinement fusion energy reactors is predicting edge plasma turbulence. An essential step in overcoming this uncertainty is the validation in accuracy of reduced turbulent transport models. Drift-reduced Braginskii two-fluid theory is one such set of reduced equations that has for decades simulated boundary plasmas in experiment, but significant questions exist regarding its predictive ability. To this end, using a novel physics-informed deep learning framework, we demonstrate the first ever direct quantitative comparisons of turbulent field fluctuations between electrostatic two-fluid theory and electromagnetic gyrokinetic modelling with good overall agreement found in magnetized helical plasmas at low normalized pressure. This framework is readily adaptable to experimental and astrophysical environments, and presents a new technique for the numerical validation and discovery of reduced global plasma turbulence models.
    Federated Learning: Applications, Challenges and Future Scopes. (arXiv:2205.09513v1 [cs.LG])
    Federated learning (FL) is a system in which a central aggregator coordinates the efforts of multiple clients to solve machine learning problems. This setting allows training data to be dispersed in order to protect privacy. The purpose of this paper is to provide an overview of FL systems with a focus on healthcare. FL is evaluated here based on its frameworks, architectures, and applications. It is shown here that FL solves the preceding issues with a shared global deep learning (DL) model via a central aggregator server. This paper examines recent developments and provides a comprehensive list of unresolved issues, inspired by the rapid growth of FL research. In the context of FL, several privacy methods are described, including secure multiparty computation, homomorphic encryption, differential privacy, and stochastic gradient descent. Furthermore, a review of various FL classes, such as horizontal and vertical FL and federated transfer learning, is provided. FL has applications in wireless communication, service recommendation, intelligent medical diagnosis systems, and healthcare, all of which are discussed in this paper. We also present a thorough review of existing FL challenges, such as privacy protection, communication cost, system heterogeneity, and unreliable model upload, followed by future research directions.
    Certified Error Control of Candidate Set Pruning for Two-Stage Relevance Ranking. (arXiv:2205.09638v1 [cs.IR])
    In information retrieval (IR), candidate set pruning has been commonly used to speed up two-stage relevance ranking. However, such an approach lacks accurate error control and often trades accuracy off against computational efficiency in an empirical fashion, lacking theoretical guarantees. In this paper, we propose the concept of certified error control of candidate set pruning for relevance ranking, which means that the test error after pruning is guaranteed to be controlled under a user-specified threshold with high probability. Both in-domain and out-of-domain experiments show that our method successfully prunes the first-stage retrieved candidate sets to improve the second-stage reranking speed while satisfying the pre-specified accuracy constraints in both settings. For example, on MS MARCO Passage v1, our method yields an average candidate set size of 27 out of 1,000 which increases the reranking speed by about 37 times, while the MRR@10 is greater than a pre-specified value of 0.38 with about 90% empirical coverage and the empirical baselines fail to provide such guarantee. Code and data are available at: https://github.com/alexlimh/CEC-Ranking.
    On the Convergence of Policy in Unregularized Policy Mirror Descent. (arXiv:2205.08176v2 [math.OC] UPDATED)
    In this short note, we give the convergence analysis of the policy in the recent famous policy mirror descent (PMD). We mainly consider the unregularized setting following [11] with generalized Bregman divergence. The difference is that we directly give the convergence rates of policy under generalized Bregman divergence. Our results are inspired by the convergence of value function in previous works and are an extension study of policy mirror descent. Though some results have already appeared in previous work, we further discover a large body of Bregman divergences could give finite-step convergence to an optimal policy, such as the classical Euclidean distance.
    Bi-LSTM Scoring Based Similarity Measurement with Agglomerative Hierarchical Clustering (AHC) for Speaker Diarization. (arXiv:2205.09709v1 [eess.AS])
    Majority of speech signals across different scenarios are never available with well-defined audio segments containing only a single speaker. A typical conversation between two speakers consists of segments where their voices overlap, interrupt each other or halt their speech in between multiple sentences. Recent advancements in diarization technology leverage neural network-based approaches to improvise multiple subsystems of speaker diarization system comprising of extracting segment-wise embedding features and detecting changes in the speaker during conversation. However, to identify speaker through clustering, models depend on methodologies like PLDA to generate similarity measure between two extracted segments from a given conversational audio. Since these algorithms ignore the temporal structure of conversations, they tend to achieve a higher Diarization Error Rate (DER), thus leading to misdetections both in terms of speaker and change identification. Therefore, to compare similarity of two speech segments both independently and sequentially, we propose a Bi-directional Long Short-term Memory network for estimating the elements present in the similarity matrix. Once the similarity matrix is generated, Agglomerative Hierarchical Clustering (AHC) is applied to further identify speaker segments based on thresholding. To evaluate the performance, Diarization Error Rate (DER%) metric is used. The proposed model achieves a low DER of 34.80% on a test set of audio samples derived from ICSI Meeting Corpus as compared to traditional PLDA based similarity measurement mechanism which achieved a DER of 39.90%.
    Can language models learn from explanations in context?. (arXiv:2204.02329v2 [cs.CL] UPDATED)
    Large language models can perform new tasks by adapting to a few in-context examples. For humans, rapid learning from examples can benefit from explanations that connect examples to task principles. We therefore investigate whether explanations of few-shot examples can allow language models to adapt more effectively. We annotate a set of 40 challenging tasks from BIG-Bench with explanations of answers to a small subset of questions, as well as a variety of matched control explanations. We evaluate the effects of various zero-shot and few-shot prompts that include different types of explanations, instructions, and controls on the performance of a range of large language models. We analyze these results using statistical multilevel modeling techniques that account for the nested dependencies among conditions, tasks, prompts, and models. We find that explanations of examples can improve performance. Adding untuned explanations to a few-shot prompt offers a modest improvement in performance; about 1/3 the effect size of adding few-shot examples, but twice the effect size of task instructions. We then show that explanations tuned for performance on a small validation set offer substantially larger benefits; building a prompt by selecting examples and explanations together substantially improves performance over selecting examples alone. Hand-tuning explanations can substantially improve performance on challenging tasks. Furthermore, even untuned explanations outperform carefully matched controls, suggesting that the benefits are due to the link between an example and its explanation, rather than lower-level features of the language used. However, only large models can benefit from explanations. In summary, explanations can support the in-context learning abilities of large language models on challenging tasks.
    Diagonal State Spaces are as Effective as Structured State Spaces. (arXiv:2203.14343v3 [cs.LG] UPDATED)
    Modeling long range dependencies in sequential data is a fundamental step towards attaining human-level performance in many modalities such as text, vision, audio and video. While attention-based models are a popular and effective choice in modeling short-range interactions, their performance on tasks requiring long range reasoning has been largely inadequate. In an exciting result, Gu et al. (ICLR 2022) proposed the $\textit{Structured State Space}$ (S4) architecture delivering large gains over state-of-the-art models on several long-range tasks across various modalities. The core proposition of S4 is the parameterization of state matrices via a diagonal plus low rank structure, allowing efficient computation. In this work, we show that one can match the performance of S4 even without the low rank correction and thus assuming the state matrices to be diagonal. Our $\textit{Diagonal State Space}$ (DSS) model matches the performance of S4 on Long Range Arena tasks, speech classification on Speech Commands dataset, while being conceptually simpler and straightforward to implement.
    TourBERT: A pretrained language model for the tourism industry. (arXiv:2201.07449v3 [cs.CL] UPDATED)
    The Bidirectional Encoder Representations from Transformers (BERT) is currently one of the most important and state-of-the-art models for natural language. However, it has also been shown that for domain-specific tasks it is helpful to pretrain BERT on a domain-specific corpus. In this paper, we present TourBERT, a pretrained language model for tourism. We describe how TourBERT was developed and evaluated. The evaluations show that TourBERT is outperforming BERT in all tourism-specific tasks.
    Generalization Analysis of Message Passing Neural Networks on Large Random Graphs. (arXiv:2202.00645v4 [cs.LG] UPDATED)
    Message passing neural networks (MPNN) have seen a steep rise in popularity since their introduction as generalizations of convolutional neural networks to graph-structured data, and are now considered state-of-the-art tools for solving a large variety of graph-focused problems. We study the generalization error of MPNNs in graph classification and regression. We assume that graphs of different classes are sampled from different random graph models. We show that, when training a MPNN on a dataset sampled from such a distribution, the generalization gap increases in the complexity of the MPNN, and decreases, not only with respect to the number of training samples, but also with the average number of nodes in the graphs. This shows how a MPNN with high complexity can generalize from a small dataset of graphs, as long as the graphs are large. The generalization bound is derived from a uniform convergence result, that shows that any MPNN, applied on a graph, approximates the MPNN applied on the geometric model that the graph discretizes.
    Learning Graph Structure from Convolutional Mixtures. (arXiv:2205.09575v1 [cs.LG])
    Machine learning frameworks such as graph neural networks typically rely on a given, fixed graph to exploit relational inductive biases and thus effectively learn from network data. However, when said graphs are (partially) unobserved, noisy, or dynamic, the problem of inferring graph structure from data becomes relevant. In this paper, we postulate a graph convolutional relationship between the observed and latent graphs, and formulate the graph learning task as a network inverse (deconvolution) problem. In lieu of eigendecomposition-based spectral methods or iterative optimization solutions, we unroll and truncate proximal gradient iterations to arrive at a parameterized neural network architecture that we call a Graph Deconvolution Network (GDN). GDNs can learn a distribution of graphs in a supervised fashion, perform link prediction or edge-weight regression tasks by adapting the loss function, and they are inherently inductive. We corroborate GDN's superior graph recovery performance and its generalization to larger graphs using synthetic data in supervised settings. Furthermore, we demonstrate the robustness and representation power of GDNs on real world neuroimaging and social network datasets.
    Cracking White-box DNN Watermarks via Invariant Neuron Transforms. (arXiv:2205.00199v2 [cs.CR] UPDATED)
    Recently, how to protect the Intellectual Property (IP) of deep neural networks (DNN) becomes a major concern for the AI industry. To combat potential model piracy, recent works explore various watermarking strategies to embed secret identity messages into the prediction behaviors or the internals (e.g., weights and neuron activation) of the target model. Sacrificing less functionality and involving more knowledge about the target model, the latter branch of watermarking schemes (i.e., white-box model watermarking) is claimed to be accurate, credible and secure against most known watermark removal attacks, with emerging research efforts and applications in the industry. In this paper, we present the first effective removal attack which cracks almost all the existing white-box watermarking schemes with provably no performance overhead and no required prior knowledge. By analyzing these IP protection mechanisms at the granularity of neurons, we for the first time discover their common dependence on a set of fragile features of a local neuron group, all of which can be arbitrarily tampered by our proposed chain of invariant neuron transforms. On $9$ state-of-the-art white-box watermarking schemes and a broad set of industry-level DNN architectures, our attack for the first time reduces the embedded identity message in the protected models to be almost random. Meanwhile, unlike known removal attacks, our attack requires no prior knowledge on the training data distribution or the adopted watermark algorithms, and leaves model functionality intact.
    Understanding Gradient Descent on Edge of Stability in Deep Learning. (arXiv:2205.09745v1 [cs.LG])
    Deep learning experiments in Cohen et al. (2021) using deterministic Gradient Descent (GD) revealed an {\em Edge of Stability (EoS)} phase when learning rate (LR) and sharpness (\emph{i.e.}, the largest eigenvalue of Hessian) no longer behave as in traditional optimization. Sharpness stabilizes around $2/$LR and loss goes up and down across iterations, yet still with an overall downward trend. The current paper mathematically analyzes a new mechanism of implicit regularization in the EoS phase, whereby GD updates due to non-smooth loss landscape turn out to evolve along some deterministic flow on the manifold of minimum loss. This is in contrast to many previous results about implicit bias either relying on infinitesimal updates or noise in gradient. Formally, for any smooth function $L$ with certain regularity condition, this effect is demonstrated for (1) {\em Normalized GD}, i.e., GD with a varying LR $ \eta_t =\frac{ \eta }{ || \nabla L(x(t)) || } $ and loss $L$; (2) GD with constant LR and loss $\sqrt{L}$. Both provably enter the Edge of Stability, with the associated flow on the manifold minimizing $\lambda_{\max}(\nabla^2 L)$. The above theoretical results have been corroborated by an experimental study.
    Closing the gap: Exact maximum likelihood training of generative autoencoders using invertible layers. (arXiv:2205.09546v1 [stat.ML])
    In this work, we provide an exact likelihood alternative to the variational training of generative autoencoders. We show that VAE-style autoencoders can be constructed using invertible layers, which offer a tractable exact likelihood without the need for any regularization terms. This is achieved while leaving complete freedom in the choice of encoder, decoder and prior architectures, making our approach a drop-in replacement for the training of existing VAEs and VAE-style models. We refer to the resulting models as Autoencoders within Flows (AEF), since the encoder, decoder and prior are defined as individual layers of an overall invertible architecture. We show that the approach results in strikingly higher performance than architecturally equivalent VAEs in term of log-likelihood, sample quality and denoising performance. In a broad sense, the main ambition of this work is to close the gap between the normalizing flow and autoencoder literature under the common framework of invertibility and exact maximum likelihood.
    A Simple Yet Effective SVD-GCN for Directed Graphs. (arXiv:2205.09335v1 [cs.LG])
    In this paper, we propose a simple yet effective graph neural network for directed graphs (digraph) based on the classic Singular Value Decomposition (SVD), named SVD-GCN. The new graph neural network is built upon the graph SVD-framelet to better decompose graph signals on the SVD ``frequency'' bands. Further the new framelet SVD-GCN is also scaled up for larger scale graphs via using Chebyshev polynomial approximation. Through empirical experiments conducted on several node classification datasets, we have found that SVD-GCN has remarkable improvements in a variety of graph node learning tasks and it outperforms GCN and many other state-of-the-art graph neural networks for digraphs. Moreover, we empirically demonstate that the SVD-GCN has great denoising capability and robustness to high level graph data attacks. The theoretical and experimental results prove that the SVD-GCN is effective on a variant of graph datasets, meanwhile maintaining stable and even better performance than the state-of-the-arts.
    Bayesian Negative Sampling for Recommendation. (arXiv:2204.06520v2 [cs.IR] UPDATED)
    How to sample high quality negative instances from unlabeled data, i.e., negative sampling, is important for training implicit collaborative filtering and contrastive learning models. Although previous studies have proposed some approaches to sample informative instances, few has been done to discriminating false negative from true negative for unbiased negative sampling. On the basis of our order relation analysis of negatives' scores, we first derive the class conditional density of true negatives and that of false negatives. We next design a Bayesian classifier for negative classification, from which we define a model-agnostic posterior probability estimate of an instance being true negative as a quantitative negative signal measure. We also propose a Bayesian optimal sampling rule to sample high-quality negatives. The proposed Bayesian Negative Sampling (BNS) algorithm has a linear time complexity. Experimental studies validate the superiority of BNS over the peers in terms of better sampling quality and better recommendation performance.
    PSI Draft Specification. (arXiv:2205.09488v1 [cs.SE])
    This document presents the draft specification for delivering machine learning services over HTTP, developed as part of the Protocols and Structures for Inference project, which concluded in 2013. It presents the motivation for providing machine learning as a service, followed by a description of the essential and optional components of such a service.
    Improving VAE based molecular representations for compound property prediction. (arXiv:2201.04929v3 [cs.LG] UPDATED)
    Collecting labeled data for many important tasks in chemoinformatics is time consuming and requires expensive experiments. In recent years, machine learning has been used to learn rich representations of molecules using large scale unlabeled molecular datasets and transfer the knowledge to solve the more challenging tasks with limited datasets. Variational autoencoders are one of the tools that have been proposed to perform the transfer for both chemical property prediction and molecular generation tasks. In this work we propose a simple method to improve chemical property prediction performance of machine learning models by incorporating additional information on correlated molecular descriptors in the representations learned by variational autoencoders. We verify the method on three property prediction asks. We explore the impact of the number of incorporated descriptors, correlation between the descriptors and the target properties, sizes of the datasets etc. Finally, we show the relation between the performance of property prediction models and the distance between property prediction dataset and the larger unlabeled dataset in the representation space.
    Morse-STF: Improved Protocols for Privacy-Preserving Machine Learning. (arXiv:2109.11726v2 [cs.CR] UPDATED)
    Secure multi-party computation enables multiple mutually distrusting parties to perform computations on data without revealing the data itself, and has become one of the core technologies behind privacy-preserving machine learning. In this work, we present several improved privacy-preserving protocols for both linear and non-linear layers in machine learning. For linear layers, we present an extended beaver triple protocol for bilinear maps that significantly reduces communication of convolution layer. For non-linear layers, we introduce novel protocols for computing the sigmoid and softmax function. Both functions are essential building blocks for machine learning training of classification tasks. Our protocols are both more scalable and robust than prior constructions, and improves runtime performance by 3-17x. Finally, we introduce Morse-STF, an end-to-end privacy-preserving system for machine learning training that leverages all these improved protocols. Our system achieves a 1.8x speedup on logistic regression and 3.9-4.9x speedup on convolutional neural networks compared to prior state-of-the-art systems.
    Differential Privacy: What is all the noise about?. (arXiv:2205.09453v1 [cs.CR])
    Differential Privacy (DP) is a formal definition of privacy that provides rigorous guarantees against risks of privacy breaches during data processing. It makes no assumptions about the knowledge or computational power of adversaries, and provides an interpretable, quantifiable and composable formalism. DP has been actively researched during the last 15 years, but it is still hard to master for many Machine Learning (ML)) practitioners. This paper aims to provide an overview of the most important ideas, concepts and uses of DP in ML, with special focus on its intersection with Federated Learning (FL).
    An Invariant Matching Property for Distribution Generalization under Intervened Response. (arXiv:2205.09162v1 [stat.ME])
    The task of distribution generalization concerns making reliable prediction of a response in unseen environments. The structural causal models are shown to be useful to model distribution changes through intervention. Motivated by the fundamental invariance principle, it is often assumed that the conditional distribution of the response given its predictors remains the same across environments. However, this assumption might be violated in practical settings when the response is intervened. In this work, we investigate a class of model with an intervened response. We identify a novel form of invariance by incorporating the estimates of certain features as additional predictors. Effectively, we show this invariance is equivalent to having a deterministic linear matching that makes the generalization possible. We provide an explicit characterization of the linear matching and present our simulation results under various intervention settings.
    Cross-lingual Transfer of Monolingual Models. (arXiv:2109.07348v2 [cs.CL] UPDATED)
    Recent studies in zero-shot cross-lingual learning using multilingual models have falsified the previous hypothesis that shared vocabulary and joint pre-training are the keys to cross-lingual generalization. Inspired by this advancement, we introduce a cross-lingual transfer method for monolingual models based on domain adaptation. We study the effects of such transfer from four different languages to English. Our experimental results on GLUE show that the transferred models outperform the native English model independently of the source language. After probing the English linguistic knowledge encoded in the representations before and after transfer, we find that semantic information is retained from the source language, while syntactic information is learned during transfer. Additionally, the results of evaluating the transferred models in source language tasks reveal that their performance in the source domain deteriorates after transfer.
    Neural ODE Control for Trajectory Approximation of Continuity Equation. (arXiv:2205.09241v1 [math.OC])
    We consider the controllability problem for the continuity equation, corresponding to neural ordinary differential equations (ODEs), which describes how a probability measure is pushedforward by the flow. We show that the controlled continuity equation has very strong controllability properties. Particularly, a given solution of the continuity equation corresponding to a bounded Lipschitz vector field defines a trajectory on the set of probability measures. For this trajectory, we show that there exist piecewise constant training weights for a neural ODE such that the solution of the continuity equation corresponding to the neural ODE is arbitrarily close to it. As a corollary to this result, we establish that the continuity equation of the neural ODE is approximately controllable on the set of compactly supported probability measures that are absolutely continuous with respect to the Lebesgue measure.
    Neighborhood Mixup Experience Replay: Local Convex Interpolation for Improved Sample Efficiency in Continuous Control Tasks. (arXiv:2205.09117v1 [cs.LG])
    Experience replay plays a crucial role in improving the sample efficiency of deep reinforcement learning agents. Recent advances in experience replay propose using Mixup (Zhang et al., 2018) to further improve sample efficiency via synthetic sample generation. We build upon this technique with Neighborhood Mixup Experience Replay (NMER), a geometrically-grounded replay buffer that interpolates transitions with their closest neighbors in state-action space. NMER preserves a locally linear approximation of the transition manifold by only applying Mixup between transitions with vicinal state-action features. Under NMER, a given transition's set of state action neighbors is dynamic and episode agnostic, in turn encouraging greater policy generalizability via inter-episode interpolation. We combine our approach with recent off-policy deep reinforcement learning algorithms and evaluate on continuous control environments. We observe that NMER improves sample efficiency by an average 94% (TD3) and 29% (SAC) over baseline replay buffers, enabling agents to effectively recombine previous experiences and learn from limited data.
    Relational representation learning with spike trains. (arXiv:2205.09140v1 [cs.NE])
    Relational representation learning has lately received an increase in interest due to its flexibility in modeling a variety of systems like interacting particles, materials and industrial projects for, e.g., the design of spacecraft. A prominent method for dealing with relational data are knowledge graph embedding algorithms, where entities and relations of a knowledge graph are mapped to a low-dimensional vector space while preserving its semantic structure. Recently, a graph embedding method has been proposed that maps graph elements to the temporal domain of spiking neural networks. However, it relies on encoding graph elements through populations of neurons that only spike once. Here, we present a model that allows us to learn spike train-based embeddings of knowledge graphs, requiring only one neuron per graph element by fully utilizing the temporal domain of spike patterns. This coding scheme can be implemented with arbitrary spiking neuron models as long as gradients with respect to spike times can be calculated, which we demonstrate for the integrate-and-fire neuron model. In general, the presented results show how relational knowledge can be integrated into spike-based systems, opening up the possibility of merging event-based computing and relational data to build powerful and energy efficient artificial intelligence applications and reasoning systems.
    Accelerated Training of Physics Informed Neural Networks (PINNs) using Meshless Discretizations. (arXiv:2205.09332v1 [cs.LG])
    We present a new technique for the accelerated training of physics-informed neural networks (PINNs): discretely-trained PINNs (DT-PINNs). The repeated computation of partial derivative terms in the PINN loss functions via automatic differentiation during training is known to be computationally expensive, especially for higher-order derivatives. DT-PINNs are trained by replacing these exact spatial derivatives with high-order accurate numerical discretizations computed using meshless radial basis function-finite differences (RBF-FD) and applied via sparse-matrix vector multiplication. The use of RBF-FD allows for DT-PINNs to be trained even on point cloud samples placed on irregular domain geometries. Additionally, though traditional PINNs (vanilla-PINNs) are typically stored and trained in 32-bit floating-point (fp32) on the GPU, we show that for DT-PINNs, using fp64 on the GPU leads to significantly faster training times than fp32 vanilla-PINNs with comparable accuracy. We demonstrate the efficiency and accuracy of DT-PINNs via a series of experiments. First, we explore the effect of network depth on both numerical and automatic differentiation of a neural network with random weights and show that RBF-FD approximations of third-order accuracy and above are more efficient while being sufficiently accurate. We then compare the DT-PINNs to vanilla-PINNs on both linear and nonlinear Poisson equations and show that DT-PINNs achieve similar losses with 2-4x faster training times on a consumer GPU. Finally, we also demonstrate that similar results can be obtained for the PINN solution to the heat equation (a space-time problem) by discretizing the spatial derivatives using RBF-FD and using automatic differentiation for the temporal derivative. Our results show that fp64 DT-PINNs offer a superior cost-accuracy profile to fp32 vanilla-PINNs.
    Dataset Pruning: Reducing Training Data by Examining Generalization Influence. (arXiv:2205.09329v1 [cs.LG])
    The great success of deep learning heavily relies on increasingly larger training data, which comes at a price of huge computational and infrastructural costs. This poses crucial questions that, do all training data contribute to model's performance? How much does each individual training sample or a sub-training-set affect the model's generalization, and how to construct a smallest subset from the entire training data as a proxy training set without significantly sacrificing the model's performance? To answer these, we propose dataset pruning, an optimization-based sample selection method that can (1) examine the influence of removing a particular set of training samples on model's generalization ability with theoretical guarantee, and (2) construct a smallest subset of training data that yields strictly constrained generalization gap. The empirically observed generalization gap of dataset pruning is substantially consistent with our theoretical expectations. Furthermore, the proposed method prunes 40% training examples on the CIFAR-10 dataset, halves the convergence time with only 1.3% test accuracy decrease, which is superior to previous score-based sample selection methods.
    GitRanking: A Ranking of GitHub Topics for Software Classification using Active Sampling. (arXiv:2205.09379v1 [cs.SE])
    GitHub is the world's largest host of source code, with more than 150M repositories. However, most of these repositories are not labeled or inadequately so, making it harder for users to find relevant projects. There have been various proposals for software application domain classification over the past years. However, these approaches lack a well-defined taxonomy that is hierarchical, grounded in a knowledge base, and free of irrelevant terms. This work proposes GitRanking, a framework for creating a classification ranked into discrete levels based on how general or specific their meaning is. We collected 121K topics from GitHub and considered $60\%$ of the most frequent ones for the ranking. GitRanking 1) uses active sampling to ensure a minimal number of required annotations; and 2) links each topic to Wikidata, reducing ambiguities and improving the reusability of the taxonomy. Our results show that developers, when annotating their projects, avoid using terms with a high degree of specificity. This makes the finding and discovery of their projects more challenging for other users. Furthermore, we show that GitRanking can effectively rank terms according to their general or specific meaning. This ranking would be an essential asset for developers to build upon, allowing them to complement their annotations with more precise topics. Finally, we show that GitRanking is a dynamically extensible method: it can currently accept further terms to be ranked with a minimum number of annotations ($\sim$ 15). This paper is the first collective attempt to build a ground-up taxonomy of software domains.
    PredictionNet: Real-Time Joint Probabilistic Traffic Prediction for Planning, Control, and Simulation. (arXiv:2109.11094v2 [cs.RO] UPDATED)
    Predicting the future motion of traffic agents is crucial for safe and efficient autonomous driving. To this end, we present PredictionNet, a deep neural network (DNN) that predicts the motion of all surrounding traffic agents together with the ego-vehicle's motion. All predictions are probabilistic and are represented in a simple top-down rasterization that allows an arbitrary number of agents. Conditioned on a multi-layer map with lane information, the network outputs future positions, velocities, and backtrace vectors jointly for all agents including the ego-vehicle in a single pass. Trajectories are then extracted from the output. The network can be used to simulate realistic traffic, and it produces competitive results on popular benchmarks. More importantly, it has been used to successfully control a real-world vehicle for hundreds of kilometers, by combining it with a motion planning/control subsystem. The network runs faster than real-time on an embedded GPU, and the system shows good generalization (across sensory modalities and locations) due to the choice of input representation. Furthermore, we demonstrate that by extending the DNN with reinforcement learning (RL), it can better handle rare or unsafe events like aggressive maneuvers and crashes.
    Efficient and Modular Implicit Differentiation. (arXiv:2105.15183v4 [cs.LG] UPDATED)
    Automatic differentiation (autodiff) has revolutionized machine learning. It allows to express complex computations by composing elementary ones in creative ways and removes the burden of computing their derivatives by hand. More recently, differentiation of optimization problem solutions has attracted widespread attention with applications such as optimization layers, and in bi-level problems such as hyper-parameter optimization and meta-learning. However, so far, implicit differentiation remained difficult to use for practitioners, as it often required case-by-case tedious mathematical derivations and implementations. In this paper, we propose automatic implicit differentiation, an efficient and modular approach for implicit differentiation of optimization problems. In our approach, the user defines directly in Python a function $F$ capturing the optimality conditions of the problem to be differentiated. Once this is done, we leverage autodiff of $F$ and the implicit function theorem to automatically differentiate the optimization problem. Our approach thus combines the benefits of implicit differentiation and autodiff. It is efficient as it can be added on top of any state-of-the-art solver and modular as the optimality condition specification is decoupled from the implicit differentiation mechanism. We show that seemingly simple principles allow to recover many existing implicit differentiation methods and create new ones easily. We demonstrate the ease of formulating and solving bi-level optimization problems using our framework. We also showcase an application to the sensitivity analysis of molecular dynamics.
    SEMI: Self-supervised Exploration via Multisensory Incongruity. (arXiv:2009.12494v2 [cs.LG] UPDATED)
    Efficient exploration is a long-standing problem in reinforcement learning since extrinsic rewards are usually sparse or missing. A popular solution to this issue is to feed an agent with novelty signals as intrinsic rewards. In this work, we introduce SEMI, a self-supervised exploration policy by incentivizing the agent to maximize a new novelty signal: multisensory incongruity, which can be measured in two aspects, perception incongruity and action incongruity. The former represents the misalignment of the multisensory inputs, while the latter represents the variance of an agent's policies under different sensory inputs. Specifically, an alignment predictor is learned to detect whether multiple sensory inputs are aligned, the error of which is used to measure perception incongruity. A policy model takes different combinations of the multisensory observations as input and outputs actions for exploration. The variance of actions is further used to measure action incongruity. Using both incongruities as intrinsic rewards, SEMI allows an agent to learn skills by exploring in a self-supervised manner without any external rewards. We further show that SEMI is compatible with extrinsic rewards and it improves sample efficiency of policy learning. The effectiveness of SEMI is demonstrated across a variety of benchmark environments including object manipulation and audio-visual games.
    A Classification of $G$-invariant Shallow Neural Networks. (arXiv:2205.09219v1 [cs.LG])
    When trying to fit a deep neural network (DNN) to a $G$-invariant target function with respect to a group $G$, it only makes sense to constrain the DNN to be $G$-invariant as well. However, there can be many different ways to do this, thus raising the problem of "$G$-invariant neural architecture design": What is the optimal $G$-invariant architecture for a given problem? Before we can consider the optimization problem itself, we must understand the search space, the architectures in it, and how they relate to one another. In this paper, we take a first step towards this goal; we prove a theorem that gives a classification of all $G$-invariant single-hidden-layer or "shallow" neural network ($G$-SNN) architectures with ReLU activation for any finite orthogonal group $G$. The proof is based on a correspondence of every $G$-SNN to a signed permutation representation of $G$ acting on the hidden neurons. The classification is equivalently given in terms of the first cohomology classes of $G$, thus admitting a topological interpretation. Based on a code implementation, we enumerate the $G$-SNN architectures for some example groups $G$ and visualize their structure. We draw the network morphisms between the enumerated architectures that can be leveraged during neural architecture search (NAS). Finally, we prove that architectures corresponding to inequivalent cohomology classes in a given cohomology ring coincide in function space only when their weight matrices are zero, and we discuss the implications of this in the context of NAS.
    Spurious Local Minima of Deep ReLU Neural Networks in the Neural Tangent Kernel Regime. (arXiv:1806.04884v3 [stat.ML] UPDATED)
    In this paper, we theoretically prove that the deep ReLU neural networks do not lie in spurious local minima in the loss landscape under the Neural Tangent Kernel (NTK) regime, that is, in the gradient descent training dynamics of the deep ReLU neural networks whose parameters are initialized by a normal distribution in the limit as the widths of the hidden layers tend to infinity.
    Hybrid Intelligent Testing in Simulation-Based Verification. (arXiv:2205.09552v1 [cs.AR])
    Efficient and effective testing for simulation-based hardware verification is challenging. Using constrained random test generation, several millions of tests may be required to achieve coverage goals. The vast majority of tests do not contribute to coverage progress, yet they consume verification resources. In this paper, we propose a hybrid intelligent testing approach combining two methods that have previously been treated separately, namely Coverage-Directed Test Selection and Novelty-Driven Verification. Coverage-Directed Test Selection learns from coverage feedback to bias testing towards the most effective tests. Novelty-Driven Verification learns to identify and simulate stimuli that differ from previous stimuli, thereby reducing the number of simulations and increasing testing efficiency. We discuss the strengths and limitations of each method, and we show how our approach addresses each method's limitations, leading to hardware testing that is both efficient and effective.
    Self-Consistent Dynamical Field Theory of Kernel Evolution in Wide Neural Networks. (arXiv:2205.09653v1 [stat.ML])
    We analyze feature learning in infinite width neural networks trained with gradient flow through a self-consistent dynamical field theory. We construct a collection of deterministic dynamical order parameters which are inner-product kernels for hidden unit activations and gradients in each layer at pairs of time points, providing a reduced description of network activity through training. These kernel order parameters collectively define the hidden layer activation distribution, the evolution of the neural tangent kernel, and consequently output predictions. For deep linear networks, these kernels satisfy a set of algebraic matrix equations. For nonlinear networks, we provide an alternating sampling procedure to self-consistently solve for the kernel order parameters. We provide comparisons of the self-consistent solution to various approximation schemes including the static NTK approximation, gradient independence assumption, and leading order perturbation theory, showing that each of these approximations can break down in regimes where general self-consistent solutions still provide an accurate description. Lastly, we provide experiments in more realistic settings which demonstrate that the loss and kernel dynamics of CNNs at fixed feature learning strength is preserved across different widths on a CIFAR classification task.
    What Is Fairness? Implications For FairML. (arXiv:2205.09622v1 [cs.LG])
    A growing body of literature in fairness-aware ML (fairML) aspires to mitigate machine learning (ML)-related unfairness in automated decision making (ADM) by defining metrics that measure fairness of an ML model and by proposing methods that ensure that trained ML models achieve low values in those measures. However, the underlying concept of fairness, i.e., the question of what fairness is, is rarely discussed, leaving a considerable gap between centuries of philosophical discussion and recent adoption of the concept in the ML community. In this work, we try to bridge this gap by formalizing a consistent concept of fairness and by translating the philosophical considerations into a formal framework for the evaluation of ML models in ADM systems. We derive that fairness problems can already arise without the presence of protected attributes, pointing out that fairness and predictive performance are not irreconcilable counterparts, but rather that the latter is necessary to achieve the former. Moreover, we argue why and how causal considerations are necessary when assessing fairness in the presence of protected attributes. Eventually, we achieve greater linguistic clarity for the discussion of fairML by clearly assigning responsibilities to stakeholders inside and outside ML.
    Learning Energy Networks with Generalized Fenchel-Young Losses. (arXiv:2205.09589v1 [cs.LG])
    Energy-based models, a.k.a. energy networks, perform inference by optimizing an energy function, typically parametrized by a neural network. This allows one to capture potentially complex relationships between inputs and outputs. To learn the parameters of the energy function, the solution to that optimization problem is typically fed into a loss function. The key challenge for training energy networks lies in computing loss gradients, as this typically requires argmin/argmax differentiation. In this paper, building upon a generalized notion of conjugate function, which replaces the usual bilinear pairing with a general energy function, we propose generalized Fenchel-Young losses, a natural loss construction for learning energy networks. Our losses enjoy many desirable properties and their gradients can be computed efficiently without argmin/argmax differentiation. We also prove the calibration of their excess risk in the case of linear-concave energies. We demonstrate our losses on multilabel classification and imitation learning tasks.
    Towards a Theory of Faithfulness: Faithful Explanations of Differentiable Classifiers over Continuous Data. (arXiv:2205.09620v1 [cs.LG])
    There is broad agreement in the literature that explanation methods should be faithful to the model that they explain, but faithfulness remains a rather vague term. We revisit faithfulness in the context of continuous data and propose two formal definitions of faithfulness for feature attribution methods. Qualitative faithfulness demands that scores reflect the true qualitative effect (positive vs. negative) of the feature on the model and quanitative faithfulness that the magnitude of scores reflect the true quantitative effect. We discuss under which conditions these requirements can be satisfied to which extent (local vs global). As an application of the conceptual idea, we look at differentiable classifiers over continuous data and characterize Gradient-scores as follows: every qualitatively faithful feature attribution method is qualitatively equivalent to Gradient-scores. Furthermore, if an attribution method is quantitatively faithful in the sense that changes of the output of the classifier are proportional to the scores of features, then it is either equivalent to gradient-scoring or it is based on an inferior approximation of the classifier. To illustrate the practical relevance of the theory, we experimentally demonstrate that popular attribution methods can fail to give faithful explanations in the setting where the data is continuous and the classifier differentiable.
    Posterior Matching for Arbitrary Conditioning. (arXiv:2201.12414v3 [cs.LG] UPDATED)
    Arbitrary conditioning is an important problem in unsupervised learning, where we seek to model the conditional densities $p(\mathbf{x}_u \mid \mathbf{x}_o)$ that underly some data, for all possible non-intersecting subsets $o, u \subset \{1, \dots , d\}$. However, the vast majority of density estimation only focuses on modeling the joint distribution $p(\mathbf{x})$, in which important conditional dependencies between features are opaque. We propose a simple and general framework, coined Posterior Matching, that enables Variational Autoencoders (VAEs) to perform arbitrary conditioning, without modification to the VAE itself. Posterior Matching applies to the numerous existing VAE-based approaches to joint density estimation, thereby circumventing the specialized models required by previous approaches to arbitrary conditioning. We find that Posterior Matching is comparable or superior to current state-of-the-art methods for a variety of tasks with an assortment of VAEs (e.g.~discrete, hierarchical, VaDE).
    Automatic Spoken Language Identification using a Time-Delay Neural Network. (arXiv:2205.09564v1 [cs.CL])
    Closed-set spoken language identification is the task of recognizing the language being spoken in a recorded audio clip from a set of known languages. In this study, a language identification system was built and trained to distinguish between Arabic, Spanish, French, and Turkish based on nothing more than recorded speech. A pre-existing multilingual dataset was used to train a series of acoustic models based on the Tedlium TDNN model to perform automatic speech recognition. The system was provided with a custom multilingual language model and a specialized pronunciation lexicon with language names prepended to phones. The trained model was used to generate phone alignments to test data from all four languages, and languages were predicted based on a voting scheme choosing the most common language prepend in an utterance. Accuracy was measured by comparing predicted languages to known languages, and was determined to be very high in identifying Spanish and Arabic, and somewhat lower in identifying Turkish and French.
    Beyond Greedy Search: Tracking by Multi-Agent Reinforcement Learning-based Beam Search. (arXiv:2205.09676v1 [cs.CV])
    Existing trackers usually select a location or proposal with the maximum score as tracking result for each frame. However, such greedy search scheme maybe not the optimal choice, especially when encountering challenging tracking scenarios like heavy occlusions and fast motion. Since the accumulated errors would make response scores not reliable anymore. In this paper, we propose a novel multi-agent reinforcement learning based beam search strategy (termed BeamTracking) to address this issue. Specifically, we formulate the tracking as a sample selection problem fulfilled by multiple parallel decision-making processes, each of which aims at picking out one sample as their tracking result in each frame. We take the target feature, proposal feature, and its response score as state, and also consider actions predicted by nearby agent, to train multi-agents to select their actions. When all the frames are processed, we select the trajectory with the maximum accumulated score as the tracking result. Extensive experiments on seven popular tracking benchmark datasets validated the effectiveness of the proposed algorithm.
    scICML: Information-theoretic Co-clustering-based Multi-view Learning for the Integrative Analysis of Single-cell Multi-omics data. (arXiv:2205.09523v1 [stat.ML])
    Modern high-throughput sequencing technologies have enabled us to profile multiple molecular modalities from the same single cell, providing unprecedented opportunities to assay celluar heterogeneity from multiple biological layers. However, the datasets generated from these technologies tend to have high level of noise and are highly sparse, bringing challenges to data analysis. In this paper, we develop a novel information-theoretic co-clustering-based multi-view learning (scICML) method for multi-omics single-cell data integration. scICML utilizes co-clusterings to aggregate similar features for each view of data and uncover the common clustering pattern for cells. In addition, scICML automatically matches the clusters of the linked features across different data types for considering the biological dependency structure across different types of genomic features. Our experiments on four real-world datasets demonstrate that scICML improves the overall clustering performance and provides biological insights into the data analysis of peripheral blood mononuclear cells.
    The First Optimal Acceleration of High-Order Methods in Smooth Convex Optimization. (arXiv:2205.09647v1 [math.OC])
    In this paper, we study the fundamental open question of finding the optimal high-order algorithm for solving smooth convex minimization problems. Arjevani et al. (2019) established the lower bound $\Omega\left(\epsilon^{-2/(3p+1)}\right)$ on the number of the $p$-th order oracle calls required by an algorithm to find an $\epsilon$-accurate solution to the problem, where the $p$-th order oracle stands for the computation of the objective function value and the derivatives up to the order $p$. However, the existing state-of-the-art high-order methods of Gasnikov et al. (2019b); Bubeck et al. (2019); Jiang et al. (2019) achieve the oracle complexity $\mathcal{O}\left(\epsilon^{-2/(3p+1)} \log (1/\epsilon)\right)$, which does not match the lower bound. The reason for this is that these algorithms require performing a complex binary search procedure, which makes them neither optimal nor practical. We fix this fundamental issue by providing the first algorithm with $\mathcal{O}\left(\epsilon^{-2/(3p+1)}\right)$ $p$-th order oracle complexity.
    Machine learning applications for noisy intermediate-scale quantum computers. (arXiv:2205.09414v1 [quant-ph])
    Quantum machine learning has proven to be a fruitful area in which to search for potential applications of quantum computers. This is particularly true for those available in the near term, so called noisy intermediate-scale quantum (NISQ) devices. In this Thesis, we develop and study three quantum machine learning applications suitable for NISQ computers, ordered in terms of increasing complexity of data presented to them. These algorithms are variational in nature and use parameterised quantum circuits (PQCs) as the underlying quantum machine learning model. The first application area is quantum classification using PQCs, where the data is classical feature vectors and their corresponding labels. Here, we study the robustness of certain data encoding strategies in such models against noise present in a quantum computer. The second area is generative modelling using quantum computers, where we use quantum circuit Born machines to learn and sample from complex probability distributions. We discuss and present a framework for quantum advantage for such models, propose gradient-based training methods and demonstrate these both numerically and on the Rigetti quantum computer up to 28 qubits. For our final application, we propose a variational algorithm in the area of approximate quantum cloning, where the data becomes quantum in nature. For the algorithm, we derive differentiable cost functions, prove theoretical guarantees such as faithfulness, and incorporate state of the art methods such as quantum architecture search. Furthermore, we demonstrate how this algorithm is useful in discovering novel implementable attacks on quantum cryptographic protocols, focusing on quantum coin flipping and key distribution as examples.
    CLCNet: Rethinking of Ensemble Modeling with Classification Confidence Network. (arXiv:2205.09612v1 [cs.LG])
    In this paper, we propose a Classification Confidence Network (CLCNet) that can determine whether the classification model classifies input samples correctly. It can take a classification result in the form of vector in any dimension, and return a confidence score as output, which represents the probability of an instance being classified correctly. We can utilize CLCNet in a simple cascade structure system consisting of several SOTA (state-of-the-art) classification models, and our experiments show that the system can achieve the following advantages: 1. The system can customize the average computation requirement (FLOPs) per image while inference. 2. Under the same computation requirement, the performance of the system can exceed any model that has identical structure with the model in the system, but different in size. In fact, this is a new type of ensemble modeling. Like general ensemble modeling, it can achieve higher performance than single classification model, yet our system requires much less computation than general ensemble modeling. We have uploaded our code to a github repository: https://github.com/yaoching0/CLCNet-Rethinking-of-Ensemble-Modeling.
    Practical Skills Demand Forecasting via Representation Learning of Temporal Dynamics. (arXiv:2205.09508v1 [econ.GN])
    Rapid technological innovation threatens to leave much of the global workforce behind. Today's economy juxtaposes white-hot demand for skilled labor against stagnant employment prospects for workers unprepared to participate in a digital economy. It is a moment of peril and opportunity for every country, with outcomes measured in long-term capital allocation and the life satisfaction of billions of workers. To meet the moment, governments and markets must find ways to quicken the rate at which the supply of skills reacts to changes in demand. More fully and quickly understanding labor market intelligence is one route. In this work, we explore the utility of time series forecasts to enhance the value of skill demand data gathered from online job advertisements. This paper presents a pipeline which makes one-shot multi-step forecasts into the future using a decade of monthly skill demand observations based on a set of recurrent neural network methods. We compare the performance of a multivariate model versus a univariate one, analyze how correlation between skills can influence multivariate model results, and present predictions of demand for a selection of skills practiced by workers in the information technology industry.
    Gold-standard solutions to the Schr\"odinger equation using deep learning: How much physics do we need?. (arXiv:2205.09438v1 [cs.LG])
    Finding accurate solutions to the Schr\"odinger equation is the key unsolved challenge of computational chemistry. Given its importance for the development of new chemical compounds, decades of research have been dedicated to this problem, but due to the large dimensionality even the best available methods do not yet reach the desired accuracy. Recently the combination of deep learning with Monte Carlo methods has emerged as a promising way to obtain highly accurate energies and moderate scaling of computational cost. In this paper we significantly contribute towards this goal by introducing a novel deep-learning architecture that achieves 40-70% lower energy error at 8x lower computational cost compared to previous approaches. Using our method we establish a new benchmark by calculating the most accurate variational ground state energies ever published for a number of different atoms and molecules. We systematically break down and measure our improvements, focusing in particular on the effect of increasing physical prior knowledge. We surprisingly find that increasing the prior knowledge given to the architecture can actually decrease accuracy.
    Spatial Autoregressive Coding for Graph Neural Recommendation. (arXiv:2205.09489v1 [cs.IR])
    Graph embedding methods including traditional shallow models and deep Graph Neural Networks (GNNs) have led to promising applications in recommendation. Nevertheless, shallow models especially random-walk-based algorithms fail to adequately exploit neighbor proximity in sampled subgraphs or sequences due to their optimization paradigm. GNN-based algorithms suffer from the insufficient utilization of high-order information and easily cause over-smoothing problems when stacking too much layers, which may deteriorate the recommendations of low-degree (long-tail) items, limiting the expressiveness and scalability. In this paper, we propose a novel framework SAC, namely Spatial Autoregressive Coding, to solve the above problems in a unified way. To adequately leverage neighbor proximity and high-order information, we design a novel spatial autoregressive paradigm. Specifically, we first randomly mask multi-hop neighbors and embed the target node by integrating all other surrounding neighbors with an explicit multi-hop attention. Then we reinforce the model to learn a neighbor-predictive coding for the target node by contrasting the coding and the masked neighbors' embedding, equipped with a new hard negative sampling strategy. To learn the minimal sufficient representation for the target-to-neighbor prediction task and remove the redundancy of neighbors, we devise Neighbor Information Bottleneck by maximizing the mutual information between target predictive coding and the masked neighbors' embedding, and simultaneously constraining those between the coding and surrounding neighbors' embedding. Experimental results on both public recommendation datasets and a real scenario web-scale dataset Douyin-Friend-Recommendation demonstrate the superiority of SAC compared with state-of-the-art methods.
    Learning-based AC-OPF Solvers on Realistic Network and Realistic Loads. (arXiv:2205.09452v1 [cs.LG])
    Deep learning approaches for the Alternating Current-Optimal Power Flow (AC-OPF) problem are under active research in recent years. A common shortcoming in this area of research is the lack of a dataset that includes both a realistic power network topology and the corresponding realistic loads. To address this issue, we construct an AC-OPF formulation-ready dataset called TAS-97 that contains realistic network information and realistic bus loads from Tasmania's electricity network. We found that the realistic loads in Tasmania are correlated between buses and they show signs of an underlying multivariate normal distribution. Feasibility-optimized end-to-end deep neural network models are trained and tested on the constructed dataset. Trained on samples with bus loads generated from a fitted multivariate normal distribution, our learning-based AC-OPF solver achieves 0.13% cost optimality gap, 99.73% feasibility rate, and 38.62 times of speedup on realistic testing samples when compared to PYPOWER.
    Why only Micro-F1? Class Weighting of Measures for Relation Classification. (arXiv:2205.09460v1 [cs.CL])
    Relation classification models are conventionally evaluated using only a single measure, e.g., micro-F1, macro-F1 or AUC. In this work, we analyze weighting schemes, such as micro and macro, for imbalanced datasets. We introduce a framework for weighting schemes, where existing schemes are extremes, and two new intermediate schemes. We show that reporting results of different weighting schemes better highlights strengths and weaknesses of a model.
    Constraint-Based Causal Structure Learning from Undersampled Graphs. (arXiv:2205.09235v1 [stat.ML])
    Graphical structures estimated by causal learning algorithms from time series data can provide highly misleading causal information if the causal timescale of the generating process fails to match the measurement timescale of the data. Although this problem has been recently recognized, practitioners have limited resources to respond to it, and so must continue using models that they know are likely misleading. Existing methods either (a) require that the difference between causal and measurement timescales is known; or (b) can handle only very small number of random variables when the timescale difference is unknown; or (c) apply to only pairs of variables, though with fewer assumptions about prior knowledge; or (d) return impractically too many solutions. This paper addresses all four challenges. We combine constraint programming with both theoretical insights into the problem structure and prior information about admissible causal interactions. The resulting system provides a practical approach that scales to significantly larger sets (>100) of random variables, does not require precise knowledge of the timescale difference, supports edge misidentification and parametric connection strengths, and can provide the optimum choice among many possible solutions. The cumulative impact of these improvements is gain of multiple orders of magnitude in speed and informativeness.
    Mobility, Communication and Computation Aware Federated Learning for Internet of Vehicles. (arXiv:2205.09529v1 [cs.LG])
    While privacy concerns entice connected and automated vehicles to incorporate on-board federated learning (FL) solutions, an integrated vehicle-to-everything communication with heterogeneous computation power aware learning platform is urgently necessary to make it a reality. Motivated by this, we propose a novel mobility, communication and computation aware online FL platform that uses on-road vehicles as learning agents. Thanks to the advanced features of modern vehicles, the on-board sensors can collect data as vehicles travel along their trajectories, while the on-board processors can train machine learning models using the collected data. To take the high mobility of vehicles into account, we consider the delay as a learning parameter and restrict it to be less than a tolerable threshold. To satisfy this threshold, the central server accepts partially trained models, the distributed roadside units (a) perform downlink multicast beamforming to minimize global model distribution delay and (b) allocate optimal uplink radio resources to minimize local model offloading delay, and the vehicle agents conduct heterogeneous local model training. Using real-world vehicle trace datasets, we validate our FL solutions. Simulation shows that the proposed integrated FL platform is robust and outperforms baseline models. With reasonable local training episodes, it can effectively satisfy all constraints and deliver near ground truth multi-horizon velocity and vehicle-specific power predictions.
    Learning Multiscale Convolutional Dictionaries for Image Reconstruction. (arXiv:2011.12815v3 [cs.CV] UPDATED)
    Convolutional neural networks (CNNs) have been tremendously successful in solving imaging inverse problems. To understand their success, an effective strategy is to construct simpler and mathematically more tractable convolutional sparse coding (CSC) models that share essential ingredients with CNNs. Existing CSC methods, however, underperform leading CNNs in challenging inverse problems. We hypothesize that the performance gap may be attributed in part to how they process images at different spatial scales: While many CNNs use multiscale feature representations, existing CSC models mostly rely on single-scale dictionaries. To close the performance gap, we thus propose a multiscale convolutional dictionary structure. The proposed dictionary structure is derived from the U-Net, arguably the most versatile and widely used CNN for image-to-image learning problems. We show that incorporating the proposed multiscale dictionary in an otherwise standard CSC framework yields performance competitive with state-of-the-art CNNs across a range of challenging inverse problems including CT and MRI reconstruction. Our work thus demonstrates the effectiveness and scalability of the multiscale CSC approach in solving challenging inverse problems.
    ODBO: Bayesian Optimization with Search Space Prescreening for Directed Protein Evolution. (arXiv:2205.09548v1 [q-bio.BM])
    Directed evolution is a versatile technique in protein engineering that mimics the process of natural selection by iteratively alternating between mutagenesis and screening in order to search for sequences that optimize a given property of interest, such as catalytic activity and binding affinity to a specified target. However, the space of possible proteins is too large to search exhaustively in the laboratory, and functional proteins are scarce in the vast sequence space. Machine learning (ML) approaches can accelerate directed evolution by learning to map protein sequences to functions without building a detailed model of the underlying physics, chemistry and biological pathways. Despite the great potentials held by these ML methods, they encounter severe challenges in identifying the most suitable sequences for a targeted function. These failures can be attributed to the common practice of adopting a high-dimensional feature representation for protein sequences and inefficient search methods. To address these issues, we propose an efficient, experimental design-oriented closed-loop optimization framework for protein directed evolution, termed ODBO, which employs a combination of novel low-dimensional protein encoding strategy and Bayesian optimization enhanced with search space prescreening via outlier detection. We further design an initial sample selection strategy to minimize the number of experimental samples for training ML models. We conduct and report four protein directed evolution experiments that substantiate the capability of the proposed framework for finding of the variants with properties of interest. We expect the ODBO framework to greatly reduce the experimental cost and time cost of directed evolution, and can be further generalized as a powerful tool for adaptive experimental design in a broader context.
    An Introduction to Quantum Machine Learning for Engineers. (arXiv:2205.09510v1 [quant-ph])
    In the current noisy intermediate-scale quantum (NISQ) era, quantum machine learning is emerging as a dominant paradigm to program gate-based quantum computers. In quantum machine learning, the gates of a quantum circuit are parametrized, and the parameters are tuned via classical optimization based on data and on measurements of the outputs of the circuit. Parametrized quantum circuits (PQCs) can efficiently address combinatorial optimization problems, implement probabilistic generative models, and carry out inference (classification and regression). This monograph provides a self-contained introduction to quantum machine learning for an audience of engineers with a background in probability and linear algebra. It first describes the necessary background, concepts, and tools necessary to describe quantum operations and measurements. Then, it covers parametrized quantum circuits, the variational quantum eigensolver, as well as unsupervised and supervised quantum machine learning formulations.
    Neural ODEs with Irregular and Noisy Data. (arXiv:2205.09479v1 [cs.LG])
    Measurement noise is an integral part while collecting data of a physical process. Thus, noise removal is necessary to draw conclusions from these data, and it often becomes essential to construct dynamical models using these data. We discuss a methodology to learn differential equation(s) using noisy and irregular sampled measurements. In our methodology, the main innovation can be seen in the integration of deep neural networks with the neural ordinary differential equations (ODEs) approach. Precisely, we aim at learning a neural network that provides (approximately) an implicit representation of the data and an additional neural network that models the vector fields of the dependent variables. We combine these two networks by constraining using neural ODEs. The proposed framework to learn a model describing the vector field is highly effective under noisy measurements. The approach can handle scenarios where dependent variables are not available at the same temporal grid. Moreover, a particular structure, e.g., second-order with respect to time, can easily be incorporated. We demonstrate the effectiveness of the proposed method for learning models using data obtained from various differential equations and present a comparison with the neural ODE method that does not make any special treatment to noise.
    Variational Inference for Bayesian Bridge Regression. (arXiv:2205.09515v1 [stat.ML])
    We study the implementation of Automatic Differentiation Variational inference (ADVI) for Bayesian inference on regression models with bridge penalization. The bridge approach uses $\ell_{\alpha}$ norm, with $\alpha \in (0, +\infty)$ to define a penalization on large values of the regression coefficients, which includes the Lasso ($\alpha = 1$) and ridge $(\alpha = 2)$ penalizations as special cases. Full Bayesian inference seamlessly provides joint uncertainty estimates for all model parameters. Although MCMC aproaches are available for bridge regression, it can be slow for large dataset, specially in high dimensions. The ADVI implementation allows the use of small batches of data at each iteration (due to stochastic gradient based algorithms), therefore speeding up computational time in comparison with MCMC. We illustrate the approach on non-parametric regression models with B-splines, although the method works seamlessly for other choices of basis functions. A simulation study shows the main properties of the proposed method.
    Riemannian Metric Learning via Optimal Transport. (arXiv:2205.09244v1 [cs.LG])
    We introduce an optimal transport-based model for learning a metric tensor from cross-sectional samples of evolving probability measures on a common Riemannian manifold. We neurally parametrize the metric as a spatially-varying matrix field and efficiently optimize our model's objective using backpropagation. Using this learned metric, we can nonlinearly interpolate between probability measures and compute geodesics on the manifold. We show that metrics learned using our method improve the quality of trajectory inference on scRNA and bird migration data at the cost of little additional cross-sectional data.
    Neural Network Architecture Beyond Width and Depth. (arXiv:2205.09459v1 [cs.LG])
    This paper proposes a new neural network architecture by introducing an additional dimension called height beyond width and depth. Neural network architectures with height, width, and depth as hyperparameters are called three-dimensional architectures. It is shown that neural networks with three-dimensional architectures are significantly more expressive than the ones with two-dimensional architectures (those with only width and depth as hyperparameters), e.g., standard fully connected networks. The new network architecture is constructed recursively via a nested structure, and hence we call a network with the new architecture nested network (NestNet). A NestNet of height $s$ is built with each hidden neuron activated by a NestNet of height $\le s-1$. When $s=1$, a NestNet degenerates to a standard network with a two-dimensional architecture. It is proved by construction that height-$s$ ReLU NestNets with $\mathcal{O}(n)$ parameters can approximate Lipschitz continuous functions on $[0,1]^d$ with an error $\mathcal{O}(n^{-(s+1)/d})$, while the optimal approximation error of standard ReLU networks with $\mathcal{O}(n)$ parameters is $\mathcal{O}(n^{-2/d})$. Furthermore, such a result is extended to generic continuous functions on $[0,1]^d$ with the approximation error characterized by the modulus of continuity. Finally, a numerical example is provided to explore the advantages of the super approximation power of ReLU NestNets.
    Torchhd: An Open-Source Python Library to Support Hyperdimensional Computing Research. (arXiv:2205.09208v1 [cs.LG])
    Hyperdimensional Computing (HDC) is a neuro-inspired computing framework that exploits high-dimensional random vector spaces. HDC uses extremely parallelizable arithmetic to provide computational solutions that balance accuracy, efficiency and robustness. This has proven especially useful in resource-limited scenarios such as embedded systems. The commitment of the scientific community to aggregate and disseminate research in this particularly multidisciplinary field has been fundamental for its advancement. Adding to this effort, we propose Torchhd, a high-performance open-source Python library for HDC. Torchhd seeks to make HDC more accessible and serves as an efficient foundation for research and application development. The easy-to-use library builds on top of PyTorch and features state-of-the-art HDC functionality, clear documentation and implementation examples from notable publications. Comparing publicly available code with their Torchhd implementation shows that experiments can run up to 104$\times$ faster. Torchhd is available at: https://github.com/hyperdimensional-computing/torchhd
    TransTab: Learning Transferable Tabular Transformers Across Tables. (arXiv:2205.09328v1 [cs.LG])
    Tabular data (or tables) are the most widely used data format in machine learning (ML). However, ML models often assume the table structure keeps fixed in training and testing. Before ML modeling, heavy data cleaning is required to merge disparate tables with different columns. This preprocessing often incurs significant data waste (e.g., removing unmatched columns and samples). How to learn ML models from multiple tables with partially overlapping columns? How to incrementally update ML models as more columns become available over time? Can we leverage model pretraining on multiple distinct tables? How to train an ML model which can predict on an unseen table? To answer all those questions, we propose to relax fixed table structures by introducing a Transferable Tabular Transformer (TransTab) for tables. The goal of TransTab is to convert each sample (a row in the table) to a generalizable embedding vector, and then apply stacked transformers for feature encoding. One methodology insight is combining column description and table cells as the raw input to a gated transformer model. The other insight is to introduce supervised and self-supervised pretraining to improve model performance. We compare TransTab with multiple baseline methods on diverse benchmark datasets and five oncology clinical trial datasets. Overall, TransTab ranks 1.00, 1.00, 1.78 out of 12 methods in supervised learning, feature incremental learning, and transfer learning scenarios, respectively; and the proposed pretraining leads to 2.3\% AUC lift on average over the supervised learning.}
    ESCADA: Efficient Safety and Context Aware Dose Allocation for Precision Medicine. (arXiv:2111.13415v2 [cs.LG] UPDATED)
    Finding an optimal individualized treatment regimen is considered one of the most challenging precision medicine problems. Various patient characteristics influence the response to the treatment, and hence, there is no one-size-fits-all regimen. Moreover, the administration of an unsafe dose during the treatment can have adverse effects on health. Therefore, a treatment model must ensure patient \emph{safety} while \emph{efficiently} optimizing the course of therapy. We study a prevalent medical problem where the treatment aims to keep a physiological variable in a safe range and preferably close to a target level, which we refer to as \emph{leveling}. Such a task may be relevant in numerous other domains as well. We propose ESCADA, a novel and generic multi-armed bandit (MAB) algorithm tailored for the leveling task, to make safe, personalized, and context-aware dose recommendations. We derive high probability upper bounds on its cumulative regret and safety guarantees. Following ESCADA's design, we also describe its Thompson sampling-based counterpart. We discuss why the straightforward adaptations of the classical MAB algorithms such as GP-UCB may not be a good fit for the leveling task. Finally, we make \emph{in silico} experiments on the bolus-insulin dose allocation problem in type-1 diabetes mellitus disease and compare our algorithms against the famous GP-UCB algorithm, the rule-based dose calculators, and a clinician.  ( 2 min )
    Deep Fusion Prior for Multi-Focus Image Super Resolution Fusion. (arXiv:2110.05706v3 [cs.CV] UPDATED)
    This paper unifies the multi-focus images fusion (MFIF) and blind super resolution (SR) problems as the multi-focus image super resolution fusion (MFISRF) task, and proposes a novel unified dataset-free unsupervised framework named deep fusion prior (DFP) to address such MFISRF task. DFP consists of SKIPnet network, DoubleReblur focus measurement tactic, decision embedding module and loss functions. In particular, DFP can obtain MFISRF only from two low-resolution inputs without any extent dataset; SKIPnet implementing unsupervised learning via deep image prior is an end-to-end generated network acting as the engine of DFP; DoubleReblur is used to determine the primary decision map without learning but based on estimated PSF and Gaussian kernels convolution; decision embedding module optimizes the decision map via learning; and DFP losses composed of content loss, joint gradient loss and gradient limit loss can obtain high-quality MFISRF results robustly. Experiments have proved that our proposed DFP approaches and even outperforms those state-of-art MFIF and SR method combinations. Additionally, DFP is a general framework, thus its networks and focus measurement tactics can be continuously updated to further improve the MFISRF performance. DFP codes are open source and will be available soon at this http URL  ( 2 min )
    CARMI: A Cache-Aware Learned Index with a Cost-based Construction Algorithm. (arXiv:2103.00858v4 [cs.DB] UPDATED)
    Learned indexes, which use machine learning models to replace traditional index structures, have shown promising results in recent studies. However, existing learned indexes exhibit a performance gap between synthetic and real-world datasets, making them far from practical indexes. In this paper, we identify that ignoring the importance of data partitioning during model training is the main reason for this problem. Thus, we explicitly apply data partitioning to index construction and propose a new efficient and updatable cache-aware RMI framework, called CARMI. Specifically, we introduce entropy as a metric to quantify and characterize the effectiveness of data partitioning of tree nodes in learned indexes and propose a novel cost model, laying a new theoretical foundation for future research. Then, based on our novel cost model, CARMI can automatically determine tree structures and model types under various datasets and workloads by a hybrid construction algorithm without any manual tuning. Furthermore, since memory accesses limit the performance of RMIs, a new cache-aware design is also applied in CARMI, which makes full use of the characteristics of the CPU cache to effectively reduce the number of memory accesses. Our experimental study shows that CARMI performs better than baselines, achieving an average of 2.2x/1.9x speedup compared to B+ Tree/ALEX, while using only about 0.77x memory space of B+ Tree. On the SOSD platform, CARMI outperforms all baselines, with an average speedup of 1.2x over the nearest competitor RMI, which has been carefully tuned for each dataset in advance.  ( 2 min )
    Denoising Noisy Neural Networks: A Bayesian Approach with Compensation. (arXiv:2105.10699v3 [cs.LG] UPDATED)
    Deep neural networks (DNNs) with noisy weights, which we refer to as noisy neural networks (NoisyNNs), arise from the training and inference of DNNs in the presence of noise. NoisyNNs emerge in many new applications, including the wireless transmission of DNNs, the efficient deployment or storage of DNNs in analog devices, and the truncation or quantization of DNN weights. This paper studies a fundamental problem of NoisyNNs: how to reconstruct the DNN weights from their noisy manifestations. While all prior works relied on the maximum likelihood (ML) estimation, this paper puts forth a denoising approach to reconstruct DNNs with the aim of maximizing the inference accuracy of the reconstructed models. The superiority of our denoiser is rigorously proven in two small-scale problems, wherein we consider a quadratic neural network function and a shallow feedforward neural network, respectively. When applied to advanced learning tasks with modern DNN architectures, our denoiser exhibits significantly better performance than the ML estimator. Consider the average test accuracy of the denoised DNN model versus the weight variance to noise power ratio (WNR) performance. When denoising a noisy ResNet34 model arising from noisy inference, our denoiser outperforms ML estimation by up to 4.1 dB to achieve a test accuracy of 60%.When denoising a noisy ResNet18 model arising from noisy training, our denoiser outperforms ML estimation by 13.4 dB and 8.3 dB to achieve test accuracies of 60% and 80%, respectively.  ( 2 min )
    Privacy preserving n-party scalar product protocol. (arXiv:2112.09436v2 [cs.CR] UPDATED)
    Privacy-preserving machine learning enables the training of models on decentralized datasets without the need to reveal the data, both on horizontal and vertically partitioned data. However, it relies on specialized techniques and algorithms to perform the necessary computations. The privacy preserving scalar product protocol, which enables the dot product of vectors without revealing them, is one popular example for its versatility. Unfortunately, the solutions currently proposed in the literature focus mainly on two-party scenarios, even though scenarios with a higher number of data parties are becoming more relevant. For example when performing analyses that require counting the number of samples which fulfill certain criteria defined across various sites, such as calculating the information gain at a node in a decision tree. In this paper we propose a generalization of the protocol for an arbitrary number of parties, based on an existing two-party method. Our proposed solution relies on a recursive resolution of smaller scalar products. After describing our proposed method, we discuss potential scalability issues. Finally, we describe the privacy guarantees and identify any concerns, as well as comparing the proposed method to the original solution in this aspect.  ( 2 min )
    Parameter-free Reduction of the Estimation Bias in Deep Reinforcement Learning for Deterministic Policy Gradients. (arXiv:2109.11788v3 [cs.LG] UPDATED)
    Approximation of the value functions in value-based deep reinforcement learning induces overestimation bias, resulting in suboptimal policies. We show that when the reinforcement signals received by the agents have a high variance, deep actor-critic approaches that overcome the overestimation bias lead to a substantial underestimation bias. We first address the detrimental issues in the existing approaches that aim to overcome such underestimation error. Then, through extensive statistical analysis, we introduce a novel, parameter-free Deep Q-learning variant to reduce this underestimation bias in deterministic policy gradients. By sampling the weights of a linear combination of two approximate critics from a highly shrunk estimation bias interval, our Q-value update rule is not affected by the variance of the rewards received by the agents throughout learning. We test the performance of the introduced improvement on a set of MuJoCo and Box2D continuous control tasks and demonstrate that it considerably outperforms the existing approaches and improves the state-of-the-art by a significant margin.  ( 2 min )
    CoSSL: Co-Learning of Representation and Classifier for Imbalanced Semi-Supervised Learning. (arXiv:2112.04564v3 [cs.CV] UPDATED)
    In this paper, we propose a novel co-learning framework (CoSSL) with decoupled representation learning and classifier learning for imbalanced SSL. To handle the data imbalance, we devise Tail-class Feature Enhancement (TFE) for classifier learning. Furthermore, the current evaluation protocol for imbalanced SSL focuses only on balanced test sets, which has limited practicality in real-world scenarios. Therefore, we further conduct a comprehensive evaluation under various shifted test distributions. In experiments, we show that our approach outperforms other methods over a large range of shifted distributions, achieving state-of-the-art performance on benchmark datasets ranging from CIFAR-10, CIFAR-100, ImageNet, to Food-101. Our code will be made publicly available.  ( 2 min )
    M3E2: Multi-gate Mixture-of-experts for Multi-treatment Effect Estimation. (arXiv:2112.07574v2 [cs.LG] UPDATED)
    This work proposes the M3E2, a multi-task learning neural network model to estimate the effect of multiple treatments. In contrast to existing methods, M3E2 can handle multiple treatment effects applied simultaneously to the same unit, continuous and binary treatments, and many covariates. We compared M3E2 with three baselines in three synthetic benchmark datasets: two with multiple treatments and one with one treatment. Our analysis showed that our method has superior performance, making more assertive estimations of the multiple treatment effects.  ( 2 min )
    DBSegment: Fast and robust segmentation of deep brain structures -- Evaluation of transportability across acquisition domains. (arXiv:2110.09473v3 [eess.IV] UPDATED)
    Segmenting deep brain structures from magnetic resonance images is important for patient diagnosis, surgical planning, and research. Most current state-of-the-art solutions follow a segmentation-by-registration approach, where subject MRIs are mapped to a template with well-defined segmentations. However, registration-based pipelines are time-consuming, thus, limiting their clinical use. This paper uses deep learning to provide a robust and efficient deep brain segmentation solution. The method consists of a pre-processing step to conform all MRI images to the same orientation, followed by a convolutional neural network using the nnU-Net framework. We use a total of 14 datasets from both research and clinical collections. Of these, seven were used for training and validation and seven were retained for independent testing. We trained the network to segment 30 deep brain structures, as well as a brain mask, using labels generated from a registration-based approach. We evaluated the generalizability of the network by performing a leave-one-dataset-out cross-validation, and extensive testing on external datasets. Furthermore, we assessed cross-domain transportability by evaluating the results separately on different domains. We achieved an average DSC of 0.89 $\pm$ 0.04 on the independent testing datasets when compared to the registration-based gold standard. On our test system, the computation time decreased from 42 minutes for a reference registration-based pipeline to 1 minute. Our proposed method is fast, robust, and generalizes with high reliability. It can be extended to the segmentation of other brain structures. The method is publicly available on GitHub, as well as a pip package for convenient usage.  ( 3 min )
    Foundation Posteriors for Approximate Probabilistic Inference. (arXiv:2205.09735v1 [cs.LG])
    Probabilistic programs provide an expressive representation language for generative models. Given a probabilistic program, we are interested in the task of posterior inference: estimating a latent variable given a set of observed variables. Existing techniques for inference in probabilistic programs often require choosing many hyper-parameters, are computationally expensive, and/or only work for restricted classes of programs. Here we formulate inference as masked language modeling: given a program, we generate a supervised dataset of variables and assignments, and randomly mask a subset of the assignments. We then train a neural network to unmask the random values, defining an approximate posterior distribution. By optimizing a single neural network across a range of programs we amortize the cost of training, yielding a ``foundation'' posterior able to do zero-shot inference for new programs. The foundation posterior can also be fine-tuned for a particular program and dataset by optimizing a variational inference objective. We show the efficacy of the approach, zero-shot and fine-tuned, on a benchmark of STAN programs.  ( 2 min )
    A Mutually Exciting Latent Space Hawkes Process Model for Continuous-time Networks. (arXiv:2205.09263v1 [cs.LG])
    Networks and temporal point processes serve as fundamental building blocks for modeling complex dynamic relational data in various domains. We propose the latent space Hawkes (LSH) model, a novel generative model for continuous-time networks of relational events, using a latent space representation for nodes. We model relational events between nodes using mutually exciting Hawkes processes with baseline intensities dependent upon the distances between the nodes in the latent space and sender and receiver specific effects. We propose an alternating minimization algorithm to jointly estimate the latent positions of the nodes and other model parameters. We demonstrate that our proposed LSH model can replicate many features observed in real temporal networks including reciprocity and transitivity, while also achieves superior prediction accuracy and provides more interpretability compared to existing models.
    Deep Learning in Business Analytics: A Clash of Expectations and Reality. (arXiv:2205.09337v1 [cs.LG])
    Our fast-paced digital economy shaped by global competition requires increased data-driven decision-making based on artificial intelligence (AI) and machine learning (ML). The benefits of deep learning (DL) are manifold, but it comes with limitations that have - so far - interfered with widespread industry adoption. This paper explains why DL - despite its popularity - has difficulties speeding up its adoption within business analytics. It is shown - by a mixture of content analysis and empirical study - that the adoption of deep learning is not only affected by computational complexity, lacking big data architecture, lack of transparency (black-box), and skill shortage, but also by the fact that DL does not outperform traditional ML models in the case of structured datasets with fixed-length feature vectors. Deep learning should be regarded as a powerful addition to the existing body of ML models instead of a one size fits all solution.
    Darts: User-Friendly Modern Machine Learning for Time Series. (arXiv:2110.03224v3 [cs.LG] UPDATED)
    We present Darts, a Python machine learning library for time series, with a focus on forecasting. Darts offers a variety of models, from classics such as ARIMA to state-of-the-art deep neural networks. The emphasis of the library is on offering modern machine learning functionalities, such as supporting multidimensional series, meta-learning on multiple series, training on large datasets, incorporating external data, ensembling models, and providing a rich support for probabilistic forecasting. At the same time, great care goes into the API design to make it user-friendly and easy to use. For instance, all models can be used using fit()/predict(), similar to scikit-learn.  ( 2 min )
    Bypassing Logits Bias in Online Class-Incremental Learning with a Generative Framework. (arXiv:2205.09347v1 [cs.LG])
    Continual learning requires the model to maintain the learned knowledge while learning from a non-i.i.d data stream continually. Due to the single-pass training setting, online continual learning is very challenging, but it is closer to the real-world scenarios where quick adaptation to new data is appealing. In this paper, we focus on online class-incremental learning setting in which new classes emerge over time. Almost all existing methods are replay-based with a softmax classifier. However, the inherent logits bias problem in the softmax classifier is a main cause of catastrophic forgetting while existing solutions are not applicable for online settings. To bypass this problem, we abandon the softmax classifier and propose a novel generative framework based on the feature space. In our framework, a generative classifier which utilizes replay memory is used for inference, and the training objective is a pair-based metric learning loss which is proven theoretically to optimize the feature space in a generative way. In order to improve the ability to learn new data, we further propose a hybrid of generative and discriminative loss to train the model. Extensive experiments on several benchmarks, including newly introduced task-free datasets, show that our method beats a series of state-of-the-art replay-based methods with discriminative classifiers, and reduces catastrophic forgetting consistently with a remarkable margin.
    IL-flOw: Imitation Learning from Observation using Normalizing Flows. (arXiv:2205.09251v1 [cs.LG])
    We present an algorithm for Inverse Reinforcement Learning (IRL) from expert state observations only. Our approach decouples reward modelling from policy learning, unlike state-of-the-art adversarial methods which require updating the reward model during policy search and are known to be unstable and difficult to optimize. Our method, IL-flOw, recovers the expert policy by modelling state-state transitions, by generating rewards using deep density estimators trained on the demonstration trajectories, avoiding the instability issues of adversarial methods. We demonstrate that using the state transition log-probability density as a reward signal for forward reinforcement learning translates to matching the trajectory distribution of the expert demonstrations, and experimentally show good recovery of the true reward signal as well as state of the art results for imitation from observation on locomotion and robotic continuous control tasks.  ( 2 min )
    Continual Pre-Training Mitigates Forgetting in Language and Vision. (arXiv:2205.09357v1 [cs.LG])
    Pre-trained models are nowadays a fundamental component of machine learning research. In continual learning, they are commonly used to initialize the model before training on the stream of non-stationary data. However, pre-training is rarely applied during continual learning. We formalize and investigate the characteristics of the continual pre-training scenario in both language and vision environments, where a model is continually pre-trained on a stream of incoming data and only later fine-tuned to different downstream tasks. We show that continually pre-trained models are robust against catastrophic forgetting and we provide strong empirical evidence supporting the fact that self-supervised pre-training is more effective in retaining previous knowledge than supervised protocols. Code is provided at https://github.com/AndreaCossu/continual-pretraining-nlp-vision .  ( 2 min )
    Consistent Interpolating Ensembles via the Manifold-Hilbert Kernel. (arXiv:2205.09342v1 [stat.ML])
    Recent research in the theory of overparametrized learning has sought to establish generalization guarantees in the interpolating regime. Such results have been established for a few common classes of methods, but so far not for ensemble methods. We devise an ensemble classification method that simultaneously interpolates the training data, and is consistent for a broad class of data distributions. To this end, we define the manifold-Hilbert kernel for data distributed on a Riemannian manifold. We prove that kernel smoothing regression using the manifold-Hilbert kernel is weakly consistent in the setting of Devroye et al. 1998. For the sphere, we show that the manifold-Hilbert kernel can be realized as a weighted random partition kernel, which arises as an infinite ensemble of partition-based classifiers.  ( 2 min )
    SiReN: Sign-Aware Recommendation Using Graph Neural Networks. (arXiv:2108.08735v2 [cs.IR] UPDATED)
    In recent years, many recommender systems using network embedding (NE) such as graph neural networks (GNNs) have been extensively studied in the sense of improving recommendation accuracy. However, such attempts have focused mostly on utilizing only the information of positive user-item interactions with high ratings. Thus, there is a challenge on how to make use of low rating scores for representing users' preferences since low ratings can be still informative in designing NE-based recommender systems. In this study, we present SiReN, a new sign-aware recommender system based on GNN models. Specifically, SiReN has three key components: 1) constructing a signed bipartite graph for more precisely representing users' preferences, which is split into two edge-disjoint graphs with positive and negative edges each, 2) generating two embeddings for the partitioned graphs with positive and negative edges via a GNN model and a multi-layer perceptron (MLP), respectively, and then using an attention model to obtain the final embeddings, and 3) establishing a sign-aware Bayesian personalized ranking (BPR) loss function in the process of optimization. Through comprehensive experiments, we empirically demonstrate that SiReN consistently outperforms state-of-the-art NE-aided recommendation methods.
    Semi-Supervised Learning for Image Classification using Compact Networks in the BioMedical Context. (arXiv:2205.09678v1 [cs.CV])
    The development of mobile and on the edge applications that embed deep convolutional neural models has the potential to revolutionise biomedicine. However, most deep learning models require computational resources that are not available in smartphones or edge devices; an issue that can be faced by means of compact models. The problem with such models is that they are, at least usually, less accurate than bigger models. In this work, we study how this limitation can be addressed with the application of semi-supervised learning techniques. We conduct several statistical analyses to compare performance of deep compact architectures when trained using semi-supervised learning methods for tackling image classification tasks in the biomedical context. In particular, we explore three families of compact networks, and two families of semi-supervised learning techniques for 10 biomedical tasks. By combining semi-supervised learning methods with compact networks, it is possible to obtain a similar performance to standard size networks. In general, the best results are obtained when combining data distillation with MixNet, and plain distillation with ResNet-18. Also, in general, NAS networks obtain better results than manually designed networks and quantized networks. The work presented in this paper shows the benefits of apply semi-supervised methods to compact networks; this allow us to create compact models that are not only as accurate as standard size models, but also faster and lighter. Finally, we have developed a library that simplifies the construction of compact models using semi-supervised learning methods.
    A Graph Data Augmentation Strategy with Entropy Preservation. (arXiv:2107.06048v2 [cs.LG] UPDATED)
    The Graph Convolutional Networks (GCN) proposed by Kipf and Welling is an effective model for semi-supervised learning, but faces the obstacle of over-smoothing, which will weaken the representation ability of GCN. Recently some works are proposed to tackle above limitation by randomly perturbing graph topology or feature matrix to generate data augmentations as input for training. However, these operations inevitably do damage to the integrity of information structures and have to sacrifice the smoothness of feature manifold. In this paper, we first introduce a novel graph entropy definition as a measure to quantitatively evaluate the smoothness of a data manifold and then point out that this graph entropy is controlled by triangle motif-based information structures. Considering the preservation of graph entropy, we propose an effective strategy to generate randomly perturbed training data but maintain both graph topology and graph entropy. Extensive experiments have been conducted on real-world datasets and the results verify the effectiveness of our proposed method in improving semi-supervised node classification accuracy compared with a surge of baselines. Beyond that, our proposed approach could significantly enhance the robustness of training process for GCN.
    RankGen: Improving Text Generation with Large Ranking Models. (arXiv:2205.09726v1 [cs.CL])
    Given an input sequence (or prefix), modern language models often assign high probabilities to output sequences that are repetitive, incoherent, or irrelevant to the prefix; as such, model-generated text also contains such artifacts. To address these issues, we present RankGen, an encoder model (1.2B parameters) that scores model generations given a prefix. RankGen can be flexibly incorporated as a scoring function in beam search and used to decode from any pretrained language model. We train RankGen using large-scale contrastive learning to map a prefix close to the ground-truth sequence that follows it and far away from two types of negatives: (1) random sequences from the same document as the prefix, and, which discourage topically-similar but irrelevant generations; (2) sequences generated from a large language model conditioned on the prefix, which discourage repetition and hallucination. Experiments across four different language models (345M-11B parameters) and two domains show that RankGen significantly outperforms decoding algorithms like nucleus, top-k, and typical sampling on both automatic metrics (85.0 vs 77.3 MAUVE) as well as human evaluations with English writers (74.5% human preference over nucleus sampling). Analysis reveals that RankGen outputs are more relevant to the prefix and improve continuity and coherence compared to baselines. We open source our model checkpoints, code, and human preferences with detailed explanations for future research.
    Differentially private Riemannian optimization. (arXiv:2205.09494v1 [math.OC])
    In this paper, we study the differentially private empirical risk minimization problem where the parameter is constrained to a Riemannian manifold. We introduce a framework of differentially private Riemannian optimization by adding noise to the Riemannian gradient on the tangent space. The noise follows a Gaussian distribution intrinsically defined with respect to the Riemannian metric. We adapt the Gaussian mechanism from the Euclidean space to the tangent space compatible to such generalized Gaussian distribution. We show that this strategy presents a simple analysis as compared to directly adding noise on the manifold. We further show privacy guarantees of the proposed differentially private Riemannian (stochastic) gradient descent using an extension of the moments accountant technique. Additionally, we prove utility guarantees under geodesic (strongly) convex, general nonconvex objectives as well as under the Riemannian Polyak-{\L}ojasiewicz condition. We show the efficacy of the proposed framework in several applications.
    AI-assisted Optimization of the ECCE Tracking System at the Electron Ion Collider. (arXiv:2205.09185v1 [physics.ins-det])
    The Electron-Ion Collider (EIC) is a cutting-edge accelerator facility that will study the nature of the "glue" that binds the building blocks of the visible matter in the universe. The proposed experiment will be realized at Brookhaven National Laboratory in approximately 10 years from now, with detector design and R&D currently ongoing. Notably, EIC is one of the first large-scale facilities to leverage Artificial Intelligence (AI) already starting from the design and R&D phases. The EIC Comprehensive Chromodynamics Experiment (ECCE) is a consortium that proposed a detector design based on a 1.5T solenoid. The EIC detector proposal review concluded that the ECCE design will serve as the reference design for an EIC detector. Herein we describe a comprehensive optimization of the ECCE tracker using AI. The work required a complex parametrization of the simulated detector system. Our approach dealt with an optimization problem in a multidimensional design space driven by multiple objectives that encode the detector performance, while satisfying several mechanical constraints. We describe our strategy and show results obtained for the ECCE tracking system. The AI-assisted design is agnostic to the simulation framework and can be extended to other sub-detectors or to a system of sub-detectors to further optimize the performance of the EIC detector.  ( 4 min )
    Fast matrix multiplication for binary and ternary CNNs on ARM CPU. (arXiv:2205.09120v1 [cs.LG])
    Low-bit quantized neural networks are of great interest in practical applications because they significantly reduce the consumption of both memory and computational resources. Binary neural networks are memory and computationally efficient as they require only one bit per weight and activation and can be computed using Boolean logic and bit count operations. QNNs with ternary weights and activations and binary weights and ternary activations aim to improve recognition quality compared to BNNs while preserving low bit-width. However, their efficient implementation is usually considered on ASICs and FPGAs, limiting their applicability in real-life tasks. At the same time, one of the areas where efficient recognition is most in demand is recognition on mobile devices using their CPUs. However, there are no known fast implementations of TBNs and TNN, only the daBNN library for BNNs inference. In this paper, we propose novel fast algorithms of ternary, ternary-binary, and binary matrix multiplication for mobile devices with ARM architecture. In our algorithms, ternary weights are represented using 2-bit encoding and binary - using one bit. It allows us to replace matrix multiplication with Boolean logic operations that can be computed on 128-bits simultaneously, using ARM NEON SIMD extension. The matrix multiplication results are accumulated in 16-bit integer registers. We also use special reordering of values in left and right matrices. All that allows us to efficiently compute a matrix product while minimizing the number of loads and stores compared to the algorithm from daBNN. Our algorithms can be used to implement inference of convolutional and fully connected layers of TNNs, TBNs, and BNNs. We evaluate them experimentally on ARM Cortex-A73 CPU and compare their inference speed to efficient implementations of full-precision, 8-bit, and 4-bit quantized matrix multiplications.  ( 2 min )
    A2C is a special case of PPO. (arXiv:2205.09123v1 [cs.LG])
    Advantage Actor-critic (A2C) and Proximal Policy Optimization (PPO) are popular deep reinforcement learning algorithms used for game AI in recent years. A common understanding is that A2C and PPO are separate algorithms because PPO's clipped objective appears significantly different than A2C's objective. In this paper, however, we show A2C is a special case of PPO. We present theoretical justifications and pseudocode analysis to demonstrate why. To validate our claim, we conduct an empirical experiment using \texttt{Stable-baselines3}, showing A2C and PPO produce the \textit{exact} same models when other settings are controlled.  ( 2 min )
    LeRaC: Learning Rate Curriculum. (arXiv:2205.09180v1 [cs.LG])
    Most curriculum learning methods require an approach to sort the data samples by difficulty, which is often cumbersome to perform. In this work, we propose a novel curriculum learning approach termed Learning Rate Curriculum (LeRaC), which leverages the use of a different learning rate for each layer of a neural network to create a data-free curriculum during the initial training epochs. More specifically, LeRaC assigns higher learning rates to neural layers closer to the input, gradually decreasing the learning rates as the layers are placed farther away from the input. The learning rates increase at various paces during the first training iterations, until they all reach the same value. From this point on, the neural model is trained as usual. This creates a model-level curriculum learning strategy that does not require sorting the examples by difficulty and is compatible with any neural network, generating higher performance levels regardless of the architecture. We conduct comprehensive experiments on eight datasets from the computer vision (CIFAR-10, CIFAR-100, Tiny ImageNet), language (BoolQ, QNLI, RTE) and audio (ESC-50, CREMA-D) domains, considering various convolutional (ResNet-18, Wide-ResNet-50, DenseNet-121), recurrent (LSTM) and transformer (CvT, BERT, SepTr) architectures, comparing our approach with the conventional training regime. Moreover, we also compare with Curriculum by Smoothing (CBS), a state-of-the-art data-free curriculum learning approach. Unlike CBS, our performance improvements over the standard training regime are consistent across all datasets and models. Furthermore, we significantly surpass CBS in terms of training time (there is no additional cost over the standard training regime for LeRaC).  ( 2 min )
    Routing and Placement of Macros using Deep Reinforcement Learning. (arXiv:2205.09289v1 [cs.LG])
    Chip placement has been one of the most time consuming task in any semi conductor area, Due to this negligence, many projects are pushed and chips availability in real markets get delayed. An engineer placing macros on a chip also needs to place it optimally to reduce the three important factors like power, performance and time. Looking at these prior problems we wanted to introduce a new method using Reinforcement Learning where we train the model to place the nodes of a chip netlist onto a chip canvas. We want to build a neural architecture that will accurately reward the agent across a wide variety of input netlist correctly.  ( 2 min )
    On the efficiency of Stochastic Quasi-Newton Methods for Deep Learning. (arXiv:2205.09121v1 [cs.LG])
    While first-order methods are popular for solving optimization problems that arise in large-scale deep learning problems, they come with some acute deficiencies. To diminish such shortcomings, there has been recent interest in applying second-order methods such as quasi-Newton based methods which construct Hessians approximations using only gradient information. The main focus of our work is to study the behaviour of stochastic quasi-Newton algorithms for training deep neural networks. We have analyzed the performance of two well-known quasi-Newton updates, the limited memory Broyden-Fletcher-Goldfarb-Shanno (BFGS) and the Symmetric Rank One (SR1). This study fills a gap concerning the real performance of both updates and analyzes whether more efficient training is obtained when using the more robust BFGS update or the cheaper SR1 formula which allows for indefinite Hessian approximations and thus can potentially help to better navigate the pathological saddle points present in the non-convex loss functions found in deep learning. We present and discuss the results of an extensive experimental study which includes the effect of batch normalization and network's architecture, the limited memory parameter, the batch size and the type of sampling strategy. we show that stochastic quasi-Newton optimizers are efficient and able to outperform in some instances the well-known first-order Adam optimizer run with the optimal combination of its numerous hyperparameters.  ( 2 min )
    BabyNet: Residual Transformer Module for Birth Weight Prediction on Fetal Ultrasound Video. (arXiv:2205.09382v1 [eess.IV])
    Predicting fetal weight at birth is an important aspect of perinatal care, particularly in the context of antenatal management, which includes the planned timing and the mode of delivery. Accurate prediction of weight using prenatal ultrasound is challenging as it requires images of specific fetal body parts during advanced pregnancy which is difficult to capture due to poor quality of images caused by the lack of amniotic fluid. As a consequence, predictions which rely on standard methods often suffer from significant errors. In this paper we propose the Residual Transformer Module which extends a 3D ResNet-based network for analysis of 2D+t spatio-temporal ultrasound video scans. Our end-to-end method, called BabyNet, automatically predicts fetal birth weight based on fetal ultrasound video scans. We evaluate BabyNet using a dedicated clinical set comprising 225 2D fetal ultrasound videos of pregnancies from 75 patients performed one day prior to delivery. Experimental results show that BabyNet outperforms several state-of-the-art methods and estimates the weight at birth with accuracy comparable to human experts. Furthermore, combining estimates provided by human experts with those computed by BabyNet yields the best results, outperforming either of other methods by a significant margin. The source code of BabyNet is available at https://github.com/SanoScience/BabyNet.  ( 2 min )
    Stochastic uncertainty analysis of gravity gradient tensor components and their combinations. (arXiv:2205.09159v1 [physics.geo-ph])
    Full tensor gravity (FTG) devices provide up to five independent components of the gravity gradient tensor. However, we do not yet have a quantitative understanding of which tensor components or combinations of components are more important to recover a subsurface density model by gravity inversion. This is mainly because different components may be more appropriate in different scenarios or purposes. Knowledge of these components in different environments can aid with selection of optimal selection of component combinations. In this work, we propose to apply stochastic inversion to assess the uncertainty of gravity gradient tensor components and their combinations. The method is therefore a quantitative approach. The applied method here is based on the geostatistical inversion (Gaussian process regression) concept using cokriging. The cokriging variances (variance function of the GP) are found to be a useful indicator for distinguishing the gravity gradient tensor components. This approach is applied to the New Found dataset to demonstrate its effectiveness in real-world applications.  ( 2 min )
    A False Sense of Security? Revisiting the State of Machine Learning-Based Industrial Intrusion Detection. (arXiv:2205.09199v1 [cs.CR])
    Anomaly-based intrusion detection promises to detect novel or unknown attacks on industrial control systems by modeling expected system behavior and raising corresponding alarms for any deviations.As manually creating these behavioral models is tedious and error-prone, research focuses on machine learning to train them automatically, achieving detection rates upwards of 99%. However, these approaches are typically trained not only on benign traffic but also on attacks and then evaluated against the same type of attack used for training. Hence, their actual, real-world performance on unknown (not trained on) attacks remains unclear. In turn, the reported near-perfect detection rates of machine learning-based intrusion detection might create a false sense of security. To assess this situation and clarify the real potential of machine learning-based industrial intrusion detection, we develop an evaluation methodology and examine multiple approaches from literature for their performance on unknown attacks (excluded from training). Our results highlight an ineffectiveness in detecting unknown attacks, with detection rates dropping to between 3.2% and 14.7% for some types of attacks. Moving forward, we derive recommendations for further research on machine learning-based approaches to ensure clarity on their ability to detect unknown attacks.  ( 2 min )
    DDXPlus: A new Dataset for Medical Automatic Diagnosis. (arXiv:2205.09148v1 [cs.CL])
    There has been rapidly growing interests in Automatic Diagnosis (AD) and Automatic Symptom Detection (ASD) systems in the machine learning research literature, aiming to assist doctors in telemedicine services. These systems are designed to interact with patients, collect evidence relevant to their concerns, and make predictions about the underlying diseases. Doctors would review the interaction, including the evidence and the predictions, before making their final decisions. Despite the recent progress, an important piece of doctors' interactions with patients is missing in the design of AD and ASD systems, namely the differential diagnosis. Its absence is largely due to the lack of datasets that include such information for models to train on. In this work, we present a large-scale synthetic dataset that includes a differential diagnosis, along with the ground truth pathology, for each patient. In addition, this dataset includes more pathologies, as well as types of symtoms and antecedents. As a proof-of-concept, we extend several existing AD and ASD systems to incorporate differential diagnosis, and provide empirical evidence that using differentials in training signals is essential for such systems to learn to predict differentials. Dataset available at https://github.com/bruzwen/ddxplus  ( 2 min )
    Computing the ensemble spread from deterministic weather predictions using conditional generative adversarial networks. (arXiv:2205.09182v1 [cs.LG])
    Ensemble prediction systems are an invaluable tool for weather forecasting. Practically, ensemble predictions are obtained by running several perturbations of the deterministic control forecast. However, ensemble prediction is associated with a high computational cost and often involves statistical post-processing steps to improve its quality. Here we propose to use deep-learning-based algorithms to learn the statistical properties of an ensemble prediction system, the ensemble spread, given only the deterministic control forecast. Thus, once trained, the costly ensemble prediction system will not be needed anymore to obtain future ensemble forecasts, and the statistical properties of the ensemble can be derived from a single deterministic forecast. We adapt the classical pix2pix architecture to a three-dimensional model and also experiment with a shared latent space encoder-decoder model, and train them against several years of operational (ensemble) weather forecasts for the 500 hPa geopotential height. The results demonstrate that the trained models indeed allow obtaining a highly accurate ensemble spread from the control forecast only.  ( 2 min )
    MESH2IR: Neural Acoustic Impulse Response Generator for Complex 3D Scenes. (arXiv:2205.09248v1 [cs.SD])
    We propose a mesh-based neural network (MESH2IR) to generate acoustic impulse responses (IRs) for indoor 3D scenes represented using a mesh. The IRs are used to create a high-quality sound experience in interactive applications and audio processing. Our method can handle input triangular meshes with arbitrary topologies (2K - 3M triangles). We present a novel training technique to train MESH2IR using energy decay relief and highlight its benefits. We also show that training MESH2IR on IRs preprocessed using our proposed technique significantly improves the accuracy of IR generation. We reduce the non-linearity in the mesh space by transforming 3D scene meshes to latent space using a graph convolution network. Our MESH2IR is more than 200 times faster than a geometric acoustic algorithm on a CPU and can generate more than 10,000 IRs per second on an NVIDIA GeForce RTX 2080 Ti GPU for a given furnished indoor 3D scene. The acoustic metrics are used to characterize the acoustic environment. We show that the acoustic metrics of the IRs predicted from our MESH2IR match the ground truth with less than 10% error. We also highlight the benefits of MESH2IR on audio and speech processing applications such as speech dereverberation and speech separation. To the best of our knowledge, ours is the first neural-network-based approach to predict IRs from a given 3D scene mesh in real-time.  ( 2 min )
    Transformer-based Program Synthesis for Low-Data Environments. (arXiv:2205.09246v1 [cs.PL])
    Recent advancements in large pre-trained transformer models (GPT2/3, T5) have found use in program synthesis to generate programs that satisfy a set of input/output examples. However, these models perform poorly on long-horizon and low-data tasks, and often don't seem to understand the semantics of the languages they generate. We investigate an approach that tackles both of these issues, by using attributed context-free-grammars of programming languages to generate programs, and then analyzing generated programs so that they can be annotated with compile and runtime attributes, such as types, so that information about the program can be remembered during long-horizon generation. We firstly find that synthesized datasets can be made efficiently and can provide transformer models with enough data in order to perform well on some synthesis tasks. We also find that giving models access to program attributes is especially effective in low-data environments, and tends improve the quality and reduce errors of transformer-generated programs.  ( 2 min )
    Dark Solitons in Bose-Einstein Condensates: A Dataset for Many-body Physics Research. (arXiv:2205.09114v1 [cond-mat.quant-gas])
    We establish a dataset of over $1.6\times10^4$ experimental images of Bose-Einstein condensates containing solitonic excitations to enable machine learning (ML) for many-body physics research. About 33 % of this dataset has manually assigned and carefully curated labels. The remainder is automatically labeled using SolDet -- an implementation of a physics-informed ML data analysis framework -- consisting of a convolutional-neural-network-based classifier and object detector as well as a statistically motivated physics-informed classifier and a quality metric. This technical note constitutes the definitive reference of the dataset, providing an opportunity for the data science community to develop more sophisticated analysis tools, to further understand nonlinear many-body physics, and even advance cold atom experiments.  ( 2 min )
    Hybrid Machine Learning Modeling of Engineering Systems -- A Probabilistic Perspective Tested on a Multiphase Flow Modeling Case Study. (arXiv:2205.09196v1 [cs.LG])
    To operate process engineering systems in a safe and reliable manner, predictive models are often used in decision making. In many cases, these are mechanistic first principles models which aim to accurately describe the process. In practice, the parameters of these models need to be tuned to the process conditions at hand. If the conditions change, which is common in practice, the model becomes inaccurate and needs to be re-tuned. In this paper, we propose a hybrid modeling machine learning framework that allows tuning first principles models to process conditions using two different types of Bayesian Neural Networks. Our approach not only estimates the expected values of the first principles model parameters but also quantifies the uncertainty of these estimates. Such an approach of hybrid machine learning modeling is not yet well described in the literature, so we believe this paper will provide an additional angle at which hybrid machine learning modeling of physical systems can be considered. As an example, we choose a multiphase pipe flow process for which we constructed a three-phase steady state model based on the drift-flux approach which can be used for modeling of pipe and well flow behavior in oil and gas production systems with or without the neural network tuning. In the simulation results, we show how uncertainty estimates of the resulting hybrid models can be used to make better operation decisions.  ( 2 min )
    PreQuEL: Quality Estimation of Machine Translation Outputs in Advance. (arXiv:2205.09178v1 [cs.CL])
    We present the task of PreQuEL, Pre-(Quality-Estimation) Learning. A PreQuEL system predicts how well a given sentence will be translated, without recourse to the actual translation, thus eschewing unnecessary resource allocation when translation quality is bound to be low. PreQuEL can be defined relative to a given MT system (e.g., some industry service) or generally relative to the state-of-the-art. From a theoretical perspective, PreQuEL places the focus on the source text, tracing properties, possibly linguistic features, that make a sentence harder to machine translate. We develop a baseline model for the task and analyze its performance. We also develop a data augmentation method (from parallel corpora), that improves results substantially. We show that this augmentation method can improve the performance of the Quality-Estimation task as well. We investigate the properties of the input text that our model is sensitive to, by testing it on challenge sets and different languages. We conclude that it is aware of syntactic and semantic distinctions, and correlates and even over-emphasizes the importance of standard NLP features.  ( 2 min )
    High-Order Multilinear Discriminant Analysis via Order-$\textit{n}$ Tensor Eigendecomposition. (arXiv:2205.09191v1 [cs.LG])
    Higher-order data with high dimensionality is of immense importance in many areas of machine learning, computer vision, and video analytics. Multidimensional arrays (commonly referred to as tensors) are used for arranging higher-order data structures while keeping the natural representation of the data samples. In the past decade, great efforts have been made to extend the classic linear discriminant analysis for higher-order data classification generally referred to as multilinear discriminant analysis (MDA). Most of the existing approaches are based on the Tucker decomposition and $\textit{n}$-mode tensor-matrix products. The current paper presents a new approach to tensor-based multilinear discriminant analysis referred to as High-Order Multilinear Discriminant Analysis (HOMLDA). This approach is based upon the tensor decomposition where an order-$\textit{n}$ tensor can be written as a product of order-$\textit{n}$ tensors and has a natural extension to traditional linear discriminant analysis (LDA). Furthermore, the resulting framework, HOMLDA, might produce a within-class scatter tensor that is close to singular. Thus, computing the inverse inaccurately may distort the discriminant analysis. To address this problem, an improved method referred to as Robust High-Order Multilinear Discriminant Analysis (RHOMLDA) is introduced. Experimental results on multiple data sets illustrate that our proposed approach provides improved classification performance with respect to the current Tucker decomposition-based supervised learning methods.  ( 2 min )
    FedILC: Weighted Geometric Mean and Invariant Gradient Covariance for Federated Learning on Non-IID Data. (arXiv:2205.09305v1 [cs.LG])
    Federated learning is a distributed machine learning approach which enables a shared server model to learn by aggregating the locally-computed parameter updates with the training data from spatially-distributed client silos. Though successfully possessing advantages in both scale and privacy, federated learning is hurt by domain shift problems, where the learning models are unable to generalize to unseen domains whose data distribution is non-i.i.d. with respect to the training domains. In this study, we propose the Federated Invariant Learning Consistency (FedILC) approach, which leverages the gradient covariance and the geometric mean of Hessians to capture both inter-silo and intra-silo consistencies of environments and unravel the domain shift problems in federated networks. The benchmark and real-world dataset experiments bring evidence that our proposed algorithm outperforms conventional baselines and similar federated learning algorithms. This is relevant to various fields such as medical healthcare, computer vision, and the Internet of Things (IoT). The code is released at https://github.com/mikemikezhu/FedILC.  ( 2 min )
    AutoQML: Automated Quantum Machine Learning for Wi-Fi Integrated Sensing and Communications. (arXiv:2205.09115v1 [cs.LG])
    Commercial Wi-Fi devices can be used for integrated sensing and communications (ISAC) to jointly exchange data and monitor indoor environment. In this paper, we investigate a proof-of-concept approach using automated quantum machine learning (AutoQML) framework called AutoAnsatz to recognize human gesture. We address how to efficiently design quantum circuits to configure quantum neural networks (QNN). The effectiveness of AutoQML is validated by an in-house experiment for human pose recognition, achieving state-of-the-art performance greater than 80% accuracy for a limited data size with a significantly small number of trainable parameters.  ( 2 min )
    Mitigating Neural Network Overconfidence with Logit Normalization. (arXiv:2205.09310v1 [cs.LG])
    Detecting out-of-distribution inputs is critical for safe deployment of machine learning models in the real world. However, neural networks are known to suffer from the overconfidence issue, where they produce abnormally high confidence for both in- and out-of-distribution inputs. In this work, we show that this issue can be mitigated through Logit Normalization (LogitNorm) -- a simple fix to the cross-entropy loss -- by enforcing a constant vector norm on the logits in training. Our method is motivated by the analysis that the norm of the logit keeps increasing during training, leading to overconfident output. Our key idea behind LogitNorm is thus to decouple the influence of output's norm during network optimization. Trained with LogitNorm, neural networks produce highly distinguishable confidence scores between in- and out-of-distribution data. Extensive experiments demonstrate the superiority of LogitNorm, reducing the average FPR95 by up to 42.30% on common benchmarks.  ( 2 min )
  • Open

    Approximating Persistent Homology for Large Datasets. (arXiv:2204.09155v2 [stat.ML] UPDATED)
    Persistent homology is an important methodology from topological data analysis which adapts theory from algebraic topology to data settings and has been successfully implemented in many applications. It produces a statistical summary in the form of a persistence diagram, which captures the shape and size of the data. Despite its widespread use, persistent homology is simply impossible to implement when a dataset is very large. In this paper we address the problem of finding a representative persistence diagram for prohibitively large datasets. We adapt the classical statistical method of bootstrapping, namely, drawing and studying smaller multiple subsamples from the large dataset. We show that the mean of the persistence diagrams of subsamples -- taken as a mean persistence measure computed from the subsamples -- is a valid approximation of the true persistent homology of the larger dataset. We give the rate of convergence of the mean persistence diagram to the true persistence diagram in terms of the number of subsamples and size of each subsample. Given the complex algebraic and geometric nature of persistent homology, we adapt the convexity and stability properties in the space of persistence diagrams together with random set theory to achieve our theoretical results for the general setting of point cloud data. We demonstrate our approach on simulated and real data, including an application of shape clustering on complex large-scale point cloud data.
    Design choice and machine learning model performances. (arXiv:2201.10239v2 [stat.ML] UPDATED)
    An increasing number of publications present the joint application of Design of Experiments (DOE) and machine learning (ML) as a methodology to collect and analyze data on a specific industrial phenomenon. However, the literature shows that the choice of the design for data collection and model for data analysis is often not driven by statistical or algorithmic advantages, thus there is a lack of studies which provide guidelines on what designs and ML models to jointly use for data collection and analysis. This article discusses the choice of design in relation to the ML model performances. A study is conducted that considers 12 experimental designs, 7 families of predictive models, 7 test functions that emulate physical processes, and 8 noise settings, both homoscedastic and heteroscedastic. The results of the research can have an immediate impact on the work of practitioners, providing guidelines for practical applications of DOE and ML.
    High-dimensional regression with potential prior information on variable importance. (arXiv:2109.11281v2 [stat.ME] UPDATED)
    There are a variety of settings where vague prior information may be available on the importance of predictors in high-dimensional regression settings. Examples include ordering on the variables offered by their empirical variances (which is typically discarded through standardisation), the lag of predictors when fitting autoregressive models in time series settings, or the level of missingness of the variables. Whilst such orderings may not match the true importance of variables, we argue that there is little to be lost, and potentially much to be gained, by using them. We propose a simple scheme involving fitting a sequence of models indicated by the ordering. We show that the computational cost for fitting all models when ridge regression is used is no more than for a single fit of ridge regression, and describe a strategy for Lasso regression that makes use of previous fits to greatly speed up fitting the entire sequence of models. We propose to select a final estimator by cross-validation and provide a general result on the quality of the best performing estimator on a test set selected from among a number $M$ of competing estimators in a high-dimensional linear regression setting. Our result requires no sparsity assumptions and shows that only a $\log M$ price is incurred compared to the unknown best estimator. We demonstrate the effectiveness of our approach when applied to missing or corrupted data, and time series settings. An R package is available on github.
    Posterior Matching for Arbitrary Conditioning. (arXiv:2201.12414v3 [cs.LG] UPDATED)
    Arbitrary conditioning is an important problem in unsupervised learning, where we seek to model the conditional densities $p(\mathbf{x}_u \mid \mathbf{x}_o)$ that underly some data, for all possible non-intersecting subsets $o, u \subset \{1, \dots , d\}$. However, the vast majority of density estimation only focuses on modeling the joint distribution $p(\mathbf{x})$, in which important conditional dependencies between features are opaque. We propose a simple and general framework, coined Posterior Matching, that enables Variational Autoencoders (VAEs) to perform arbitrary conditioning, without modification to the VAE itself. Posterior Matching applies to the numerous existing VAE-based approaches to joint density estimation, thereby circumventing the specialized models required by previous approaches to arbitrary conditioning. We find that Posterior Matching is comparable or superior to current state-of-the-art methods for a variety of tasks with an assortment of VAEs (e.g.~discrete, hierarchical, VaDE).
    Spiked Covariance Estimation from Modulo-Reduced Measurements. (arXiv:2110.01150v3 [cs.IT] UPDATED)
    Consider the rank-1 spiked model: $\bf{X}=\sqrt{\nu}\xi \bf{u}+ \bf{Z}$, where $\nu$ is the spike intensity, $\bf{u}\in\mathbb{S}^{k-1}$ is an unknown direction and $\xi\sim \mathcal{N}(0,1),\bf{Z}\sim \mathcal{N}(\bf{0},\bf{I})$. Motivated by recent advances in analog-to-digital conversion, we study the problem of recovering $\bf{u}\in \mathbb{S}^{k-1}$ from $n$ i.i.d. modulo-reduced measurements $\bf{Y}=[\bf{X}]\mod \Delta$, focusing on the high-dimensional regime ($k\gg 1$). We develop and analyze an algorithm that, for most directions $\bf{u}$ and $\nu=\mathrm{poly}(k)$, estimates $\bf{u}$ to high accuracy using $n=\mathrm{poly}(k)$ measurements, provided that $\Delta\gtrsim \sqrt{\log k}$. Up to constants, our algorithm accurately estimates $\bf{u}$ at the smallest possible $\Delta$ that allows (in an information-theoretic sense) to recover $\bf{X}$ from $\bf{Y}$. A key step in our analysis involves estimating the probability that a line segment of length $\approx\sqrt{\nu}$ in a random direction $\bf{u}$ passes near a point in the lattice $\Delta \mathbb{Z}^k$. Numerical experiments show that the developed algorithm performs well even in a non-asymptotic setting.
    Parameter-free Reduction of the Estimation Bias in Deep Reinforcement Learning for Deterministic Policy Gradients. (arXiv:2109.11788v3 [cs.LG] UPDATED)
    Approximation of the value functions in value-based deep reinforcement learning induces overestimation bias, resulting in suboptimal policies. We show that when the reinforcement signals received by the agents have a high variance, deep actor-critic approaches that overcome the overestimation bias lead to a substantial underestimation bias. We first address the detrimental issues in the existing approaches that aim to overcome such underestimation error. Then, through extensive statistical analysis, we introduce a novel, parameter-free Deep Q-learning variant to reduce this underestimation bias in deterministic policy gradients. By sampling the weights of a linear combination of two approximate critics from a highly shrunk estimation bias interval, our Q-value update rule is not affected by the variance of the rewards received by the agents throughout learning. We test the performance of the introduced improvement on a set of MuJoCo and Box2D continuous control tasks and demonstrate that it considerably outperforms the existing approaches and improves the state-of-the-art by a significant margin.
    Turbulent field fluctuations in gyrokinetic and fluid plasmas. (arXiv:2107.09744v2 [physics.plasm-ph] CROSS LISTED)
    A key uncertainty in the design and development of magnetic confinement fusion energy reactors is predicting edge plasma turbulence. An essential step in overcoming this uncertainty is the validation in accuracy of reduced turbulent transport models. Drift-reduced Braginskii two-fluid theory is one such set of reduced equations that has for decades simulated boundary plasmas in experiment, but significant questions exist regarding its predictive ability. To this end, using a novel physics-informed deep learning framework, we demonstrate the first ever direct quantitative comparisons of turbulent field fluctuations between electrostatic two-fluid theory and electromagnetic gyrokinetic modelling with good overall agreement found in magnetized helical plasmas at low normalized pressure. This framework is readily adaptable to experimental and astrophysical environments, and presents a new technique for the numerical validation and discovery of reduced global plasma turbulence models.
    Overcoming challenges in leveraging GANs for few-shot data augmentation. (arXiv:2203.16662v2 [stat.ML] UPDATED)
    In this paper, we explore the use of GAN-based few-shot data augmentation as a method to improve few-shot classification performance. We perform an exploration into how a GAN can be fine-tuned for such a task (one of which is in a class-incremental manner), as well as a rigorous empirical investigation into how well these models can perform to improve few-shot classification. We identify issues related to the difficulty of training such generative models under a purely supervised regime with very few examples, as well as issues regarding the evaluation protocols of existing works. We also find that in this regime, classification accuracy is highly sensitive to how the classes of the dataset are randomly split. Therefore, we propose a semi-supervised fine-tuning approach as a more pragmatic way forward to address these problems.
    SEMI: Self-supervised Exploration via Multisensory Incongruity. (arXiv:2009.12494v2 [cs.LG] UPDATED)
    Efficient exploration is a long-standing problem in reinforcement learning since extrinsic rewards are usually sparse or missing. A popular solution to this issue is to feed an agent with novelty signals as intrinsic rewards. In this work, we introduce SEMI, a self-supervised exploration policy by incentivizing the agent to maximize a new novelty signal: multisensory incongruity, which can be measured in two aspects, perception incongruity and action incongruity. The former represents the misalignment of the multisensory inputs, while the latter represents the variance of an agent's policies under different sensory inputs. Specifically, an alignment predictor is learned to detect whether multiple sensory inputs are aligned, the error of which is used to measure perception incongruity. A policy model takes different combinations of the multisensory observations as input and outputs actions for exploration. The variance of actions is further used to measure action incongruity. Using both incongruities as intrinsic rewards, SEMI allows an agent to learn skills by exploring in a self-supervised manner without any external rewards. We further show that SEMI is compatible with extrinsic rewards and it improves sample efficiency of policy learning. The effectiveness of SEMI is demonstrated across a variety of benchmark environments including object manipulation and audio-visual games.
    Trajectory Inference via Mean-field Langevin in Path Space. (arXiv:2205.07146v2 [math.OC] UPDATED)
    Trajectory inference aims at recovering the dynamics of a population from snapshots of its temporal marginals. To solve this task, a min-entropy estimator relative to the Wiener measure in path space was introduced by Lavenant et al. arXiv:2102.09204, and shown to consistently recover the dynamics of a large class of drift-diffusion processes from the solution of an infinite dimensional convex optimization problem. In this paper, we introduce a grid-free algorithm to compute this estimator. Our method consists in a family of point clouds (one per snapshot) coupled via Schr\"odinger bridges which evolve with noisy gradient descent. We study the mean-field limit of the dynamics and prove its global convergence at an exponential rate to the desired estimator. Overall, this leads to an inference method with end-to-end theoretical guarantees that solves an interpretable model for trajectory inference. We also present how to adapt the method to deal with mass variations, a useful extension when dealing with single cell RNA-sequencing data where cells can branch and die.
    Interpretable Latent Variables in Deep State Space Models. (arXiv:2203.02057v2 [stat.ML] UPDATED)
    We introduce a new version of deep state-space models (DSSMs) that combines a recurrent neural network with a state-space framework to forecast time series data. The model estimates the observed series as functions of latent variables that evolve non-linearly through time. Due to the complexity and non-linearity inherent in DSSMs, previous works on DSSMs typically produced latent variables that are very difficult to interpret. Our paper focus on producing interpretable latent parameters with two key modifications. First, we simplify the predictive decoder by restricting the response variables to be a linear transformation of the latent variables plus some noise. Second, we utilize shrinkage priors on the latent variables to reduce redundancy and improve robustness. These changes make the latent variables much easier to understand and allow us to interpret the resulting latent variables as random effects in a linear mixed model. We show through two public benchmark datasets the resulting model improves forecasting performances.
    Foundation Posteriors for Approximate Probabilistic Inference. (arXiv:2205.09735v1 [cs.LG])
    Probabilistic programs provide an expressive representation language for generative models. Given a probabilistic program, we are interested in the task of posterior inference: estimating a latent variable given a set of observed variables. Existing techniques for inference in probabilistic programs often require choosing many hyper-parameters, are computationally expensive, and/or only work for restricted classes of programs. Here we formulate inference as masked language modeling: given a program, we generate a supervised dataset of variables and assignments, and randomly mask a subset of the assignments. We then train a neural network to unmask the random values, defining an approximate posterior distribution. By optimizing a single neural network across a range of programs we amortize the cost of training, yielding a ``foundation'' posterior able to do zero-shot inference for new programs. The foundation posterior can also be fine-tuned for a particular program and dataset by optimizing a variational inference objective. We show the efficacy of the approach, zero-shot and fine-tuned, on a benchmark of STAN programs.
    Universal Lower Bound for Learning Causal DAGs with Atomic Interventions. (arXiv:2111.05070v4 [cs.LG] UPDATED)
    A well-studied challenge that arises in the structure learning problem of causal directed acyclic graphs (DAG) is that using observational data, one can only learn the graph up to a "Markov equivalence class" (MEC). The remaining undirected edges have to be oriented using interventions, which can be very expensive to perform in applications. Thus, the problem of minimizing the number of interventions needed to fully orient the MEC has received a lot of recent attention, and is also the focus of this work. Our first result is a new universal lower bound on the number of single-node interventions that any algorithm (whether active or passive) would need to perform in order to orient a given MEC. Our second result shows that this bound is, in fact, within a factor of two of the size of the smallest set of single-node interventions that can orient the MEC. Our lower bound is provably better than previously known lower bounds. Further, using simulations on synthetic graphs and by giving examples of special graph families, we show that our bound is often significantly better. To prove our lower bound, we develop the notion of clique-block shared-parents (CBSP) orderings, which are topological orderings of DAGs without v-structures and satisfy certain special properties. We also use the techniques developed here to extend our results to the setting of multi-node interventions.
    M3E2: Multi-gate Mixture-of-experts for Multi-treatment Effect Estimation. (arXiv:2112.07574v2 [cs.LG] UPDATED)
    This work proposes the M3E2, a multi-task learning neural network model to estimate the effect of multiple treatments. In contrast to existing methods, M3E2 can handle multiple treatment effects applied simultaneously to the same unit, continuous and binary treatments, and many covariates. We compared M3E2 with three baselines in three synthetic benchmark datasets: two with multiple treatments and one with one treatment. Our analysis showed that our method has superior performance, making more assertive estimations of the multiple treatment effects.
    Bayesian Network Structure Learning using Digital Annealer. (arXiv:2006.06926v3 [cs.LG] UPDATED)
    Annealing processors, which solve a quadratic unconstrained binary optimization (QUBO), are a potential breakthrough in improving the accuracy of score-based Bayesian network structure learning. However, currently, the bit capacity of an annealing processor is very limited. To utilize the power of annealing processors, it is necessary to encode score-based learning problems into QUBO within the upper bound of bits. In this paper, we propose a novel approach with the decomposition of candidate parent sets. Experimental results on benchmark networks with $37$ to $223$ variables show that our approach requires lesser bits than the bit capacity of the fourth-generation Fujitsu Digital Annealer, a fully coupled annealing processor developed with semiconductor technology. Moreover, we demonstrate that the Digital Annealer with our conversion method outperforms existing algorithms on some benchmark networks. It is expected that our approach promotes the utility of annealing processors in learning the Bayesian network.
    Spurious Local Minima of Deep ReLU Neural Networks in the Neural Tangent Kernel Regime. (arXiv:1806.04884v3 [stat.ML] UPDATED)
    In this paper, we theoretically prove that the deep ReLU neural networks do not lie in spurious local minima in the loss landscape under the Neural Tangent Kernel (NTK) regime, that is, in the gradient descent training dynamics of the deep ReLU neural networks whose parameters are initialized by a normal distribution in the limit as the widths of the hidden layers tend to infinity.  ( 2 min )
    Flexible Modeling and Multitask Learning using Differentiable Tree Ensembles. (arXiv:2205.09717v1 [cs.LG])
    Decision tree ensembles are widely used and competitive learning models. Despite their success, popular toolkits for learning tree ensembles have limited modeling capabilities. For instance, these toolkits support a limited number of loss functions and are restricted to single task learning. We propose a flexible framework for learning tree ensembles, which goes beyond existing toolkits to support arbitrary loss functions, missing responses, and multi-task learning. Our framework builds on differentiable (a.k.a. soft) tree ensembles, which can be trained using first-order methods. However, unlike classical trees, differentiable trees are difficult to scale. We therefore propose a novel tensor-based formulation of differentiable trees that allows for efficient vectorization on GPUs. We perform experiments on a collection of 28 real open-source and proprietary datasets, which demonstrate that our framework can lead to 100x more compact and 23% more expressive tree ensembles than those by popular toolkits.  ( 2 min )
    Efficient and Modular Implicit Differentiation. (arXiv:2105.15183v4 [cs.LG] UPDATED)
    Automatic differentiation (autodiff) has revolutionized machine learning. It allows to express complex computations by composing elementary ones in creative ways and removes the burden of computing their derivatives by hand. More recently, differentiation of optimization problem solutions has attracted widespread attention with applications such as optimization layers, and in bi-level problems such as hyper-parameter optimization and meta-learning. However, so far, implicit differentiation remained difficult to use for practitioners, as it often required case-by-case tedious mathematical derivations and implementations. In this paper, we propose automatic implicit differentiation, an efficient and modular approach for implicit differentiation of optimization problems. In our approach, the user defines directly in Python a function $F$ capturing the optimality conditions of the problem to be differentiated. Once this is done, we leverage autodiff of $F$ and the implicit function theorem to automatically differentiate the optimization problem. Our approach thus combines the benefits of implicit differentiation and autodiff. It is efficient as it can be added on top of any state-of-the-art solver and modular as the optimality condition specification is decoupled from the implicit differentiation mechanism. We show that seemingly simple principles allow to recover many existing implicit differentiation methods and create new ones easily. We demonstrate the ease of formulating and solving bi-level optimization problems using our framework. We also showcase an application to the sensitivity analysis of molecular dynamics.  ( 2 min )
    Spherical Perspective on Learning with Normalization Layers. (arXiv:2006.13382v3 [cs.LG] UPDATED)
    Normalization Layers (NLs) are widely used in modern deep-learning architectures. Despite their apparent simplicity, their effect on optimization is not yet fully understood. This paper introduces a spherical framework to study the optimization of neural networks with NLs from a geometric perspective. Concretely, the radial invariance of groups of parameters, such as filters for convolutional neural networks, allows to translate the optimization steps on the $L_2$ unit hypersphere. This formulation and the associated geometric interpretation shed new light on the training dynamics. Firstly, the first effective learning rate expression of Adam is derived. Then the demonstration that, in the presence of NLs, performing Stochastic Gradient Descent (SGD) alone is actually equivalent to a variant of Adam constrained to the unit hypersphere, stems from the framework. Finally, this analysis outlines phenomena that previous variants of Adam act on and their importance in the optimization process are experimentally validated.  ( 2 min )
    The Franz-Parisi Criterion and Computational Trade-offs in High Dimensional Statistics. (arXiv:2205.09727v1 [math.ST])
    Many high-dimensional statistical inference problems are believed to possess inherent computational hardness. Various frameworks have been proposed to give rigorous evidence for such hardness, including lower bounds against restricted models of computation (such as low-degree functions), as well as methods rooted in statistical physics that are based on free energy landscapes. This paper aims to make a rigorous connection between the seemingly different low-degree and free-energy based approaches. We define a free-energy based criterion for hardness and formally connect it to the well-established notion of low-degree hardness for a broad class of statistical problems, namely all Gaussian additive models and certain models with a sparse planted signal. By leveraging these rigorous connections we are able to: establish that for Gaussian additive models the "algebraic" notion of low-degree hardness implies failure of "geometric" local MCMC algorithms, and provide new low-degree lower bounds for sparse linear regression which seem difficult to prove directly. These results provide both conceptual insights into the connections between different notions of hardness, as well as concrete technical tools such as new methods for proving low-degree lower bounds.  ( 2 min )
    Self-Consistent Dynamical Field Theory of Kernel Evolution in Wide Neural Networks. (arXiv:2205.09653v1 [stat.ML])
    We analyze feature learning in infinite width neural networks trained with gradient flow through a self-consistent dynamical field theory. We construct a collection of deterministic dynamical order parameters which are inner-product kernels for hidden unit activations and gradients in each layer at pairs of time points, providing a reduced description of network activity through training. These kernel order parameters collectively define the hidden layer activation distribution, the evolution of the neural tangent kernel, and consequently output predictions. For deep linear networks, these kernels satisfy a set of algebraic matrix equations. For nonlinear networks, we provide an alternating sampling procedure to self-consistently solve for the kernel order parameters. We provide comparisons of the self-consistent solution to various approximation schemes including the static NTK approximation, gradient independence assumption, and leading order perturbation theory, showing that each of these approximations can break down in regimes where general self-consistent solutions still provide an accurate description. Lastly, we provide experiments in more realistic settings which demonstrate that the loss and kernel dynamics of CNNs at fixed feature learning strength is preserved across different widths on a CIFAR classification task.  ( 2 min )
    Disentangling Active and Passive Cosponsorship in the U.S. Congress. (arXiv:2205.09674v1 [cs.LG])
    In the U.S. Congress, legislators can use active and passive cosponsorship to support bills. We show that these two types of cosponsorship are driven by two different motivations: the backing of political colleagues and the backing of the bill's content. To this end, we develop an Encoder+RGCN based model that learns legislator representations from bill texts and speech transcripts. These representations predict active and passive cosponsorship with an F1-score of 0.88. Applying our representations to predict voting decisions, we show that they are interpretable and generalize to unseen tasks.  ( 2 min )
    Augmented Lagrangian Methods for Time-varying Constrained Online Convex Optimization. (arXiv:2205.09571v1 [math.OC])
    In this paper, we consider online convex optimization (OCO) with time-varying loss and constraint functions. Specifically, the decision maker chooses sequential decisions based only on past information, meantime the loss and constraint functions are revealed over time. We first develop a class of model-based augmented Lagrangian methods (MALM) for time-varying functional constrained OCO (without feedback delay). Under standard assumptions, we establish sublinear regret and sublinear constraint violation of MALM. Furthermore, we extend MALM to deal with time-varying functional constrained OCO with delayed feedback, in which the feedback information of loss and constraint functions is revealed to decision maker with delays. Without additional assumptions, we also establish sublinear regret and sublinear constraint violation for the delayed version of MALM. Finally, numerical results for several examples of constrained OCO including online network resource allocation, online logistic regression and online quadratically constrained quadratical program are presented to demonstrate the efficiency of the proposed algorithms.  ( 2 min )
    Variational Inference for Bayesian Bridge Regression. (arXiv:2205.09515v1 [stat.ML])
    We study the implementation of Automatic Differentiation Variational inference (ADVI) for Bayesian inference on regression models with bridge penalization. The bridge approach uses $\ell_{\alpha}$ norm, with $\alpha \in (0, +\infty)$ to define a penalization on large values of the regression coefficients, which includes the Lasso ($\alpha = 1$) and ridge $(\alpha = 2)$ penalizations as special cases. Full Bayesian inference seamlessly provides joint uncertainty estimates for all model parameters. Although MCMC aproaches are available for bridge regression, it can be slow for large dataset, specially in high dimensions. The ADVI implementation allows the use of small batches of data at each iteration (due to stochastic gradient based algorithms), therefore speeding up computational time in comparison with MCMC. We illustrate the approach on non-parametric regression models with B-splines, although the method works seamlessly for other choices of basis functions. A simulation study shows the main properties of the proposed method.  ( 2 min )
    Consistent Interpolating Ensembles via the Manifold-Hilbert Kernel. (arXiv:2205.09342v1 [stat.ML])
    Recent research in the theory of overparametrized learning has sought to establish generalization guarantees in the interpolating regime. Such results have been established for a few common classes of methods, but so far not for ensemble methods. We devise an ensemble classification method that simultaneously interpolates the training data, and is consistent for a broad class of data distributions. To this end, we define the manifold-Hilbert kernel for data distributed on a Riemannian manifold. We prove that kernel smoothing regression using the manifold-Hilbert kernel is weakly consistent in the setting of Devroye et al. 1998. For the sphere, we show that the manifold-Hilbert kernel can be realized as a weighted random partition kernel, which arises as an infinite ensemble of partition-based classifiers.  ( 2 min )
    Closing the gap: Exact maximum likelihood training of generative autoencoders using invertible layers. (arXiv:2205.09546v1 [stat.ML])
    In this work, we provide an exact likelihood alternative to the variational training of generative autoencoders. We show that VAE-style autoencoders can be constructed using invertible layers, which offer a tractable exact likelihood without the need for any regularization terms. This is achieved while leaving complete freedom in the choice of encoder, decoder and prior architectures, making our approach a drop-in replacement for the training of existing VAEs and VAE-style models. We refer to the resulting models as Autoencoders within Flows (AEF), since the encoder, decoder and prior are defined as individual layers of an overall invertible architecture. We show that the approach results in strikingly higher performance than architecturally equivalent VAEs in term of log-likelihood, sample quality and denoising performance. In a broad sense, the main ambition of this work is to close the gap between the normalizing flow and autoencoder literature under the common framework of invertibility and exact maximum likelihood.  ( 2 min )
    Differentially private Riemannian optimization. (arXiv:2205.09494v1 [math.OC])
    In this paper, we study the differentially private empirical risk minimization problem where the parameter is constrained to a Riemannian manifold. We introduce a framework of differentially private Riemannian optimization by adding noise to the Riemannian gradient on the tangent space. The noise follows a Gaussian distribution intrinsically defined with respect to the Riemannian metric. We adapt the Gaussian mechanism from the Euclidean space to the tangent space compatible to such generalized Gaussian distribution. We show that this strategy presents a simple analysis as compared to directly adding noise on the manifold. We further show privacy guarantees of the proposed differentially private Riemannian (stochastic) gradient descent using an extension of the moments accountant technique. Additionally, we prove utility guarantees under geodesic (strongly) convex, general nonconvex objectives as well as under the Riemannian Polyak-{\L}ojasiewicz condition. We show the efficacy of the proposed framework in several applications.  ( 2 min )
    Neural Network Architecture Beyond Width and Depth. (arXiv:2205.09459v1 [cs.LG])
    This paper proposes a new neural network architecture by introducing an additional dimension called height beyond width and depth. Neural network architectures with height, width, and depth as hyperparameters are called three-dimensional architectures. It is shown that neural networks with three-dimensional architectures are significantly more expressive than the ones with two-dimensional architectures (those with only width and depth as hyperparameters), e.g., standard fully connected networks. The new network architecture is constructed recursively via a nested structure, and hence we call a network with the new architecture nested network (NestNet). A NestNet of height $s$ is built with each hidden neuron activated by a NestNet of height $\le s-1$. When $s=1$, a NestNet degenerates to a standard network with a two-dimensional architecture. It is proved by construction that height-$s$ ReLU NestNets with $\mathcal{O}(n)$ parameters can approximate Lipschitz continuous functions on $[0,1]^d$ with an error $\mathcal{O}(n^{-(s+1)/d})$, while the optimal approximation error of standard ReLU networks with $\mathcal{O}(n)$ parameters is $\mathcal{O}(n^{-2/d})$. Furthermore, such a result is extended to generic continuous functions on $[0,1]^d$ with the approximation error characterized by the modulus of continuity. Finally, a numerical example is provided to explore the advantages of the super approximation power of ReLU NestNets.  ( 2 min )
    Robust Deep Neural Network Estimation for Multi-dimensional Functional Data. (arXiv:2205.09604v1 [stat.ME])
    In this paper, we propose a robust estimator for the location function from multi-dimensional functional data. The proposed estimators are based on the deep neural networks with ReLU activation function. At the meanwhile, the estimators are less susceptible to outlying observations and model-misspecification. For any multi-dimensional functional data, we provide the uniform convergence rates for the proposed robust deep neural networks estimators. Simulation studies illustrate the competitive performance of the robust deep neural network estimators on regular data and their superior performance on data that contain anomalies. The proposed method is also applied to analyze 2D and 3D images of patients with Alzheimer's disease obtained from the Alzheimer Disease Neuroimaging Initiative database.  ( 2 min )
    Dark Solitons in Bose-Einstein Condensates: A Dataset for Many-body Physics Research. (arXiv:2205.09114v1 [cond-mat.quant-gas])
    We establish a dataset of over $1.6\times10^4$ experimental images of Bose-Einstein condensates containing solitonic excitations to enable machine learning (ML) for many-body physics research. About 33 % of this dataset has manually assigned and carefully curated labels. The remainder is automatically labeled using SolDet -- an implementation of a physics-informed ML data analysis framework -- consisting of a convolutional-neural-network-based classifier and object detector as well as a statistically motivated physics-informed classifier and a quality metric. This technical note constitutes the definitive reference of the dataset, providing an opportunity for the data science community to develop more sophisticated analysis tools, to further understand nonlinear many-body physics, and even advance cold atom experiments.  ( 2 min )
    Riemannian Metric Learning via Optimal Transport. (arXiv:2205.09244v1 [cs.LG])
    We introduce an optimal transport-based model for learning a metric tensor from cross-sectional samples of evolving probability measures on a common Riemannian manifold. We neurally parametrize the metric as a spatially-varying matrix field and efficiently optimize our model's objective using backpropagation. Using this learned metric, we can nonlinearly interpolate between probability measures and compute geodesics on the manifold. We show that metrics learned using our method improve the quality of trajectory inference on scRNA and bird migration data at the cost of little additional cross-sectional data.  ( 2 min )
    Learning Energy Networks with Generalized Fenchel-Young Losses. (arXiv:2205.09589v1 [cs.LG])
    Energy-based models, a.k.a. energy networks, perform inference by optimizing an energy function, typically parametrized by a neural network. This allows one to capture potentially complex relationships between inputs and outputs. To learn the parameters of the energy function, the solution to that optimization problem is typically fed into a loss function. The key challenge for training energy networks lies in computing loss gradients, as this typically requires argmin/argmax differentiation. In this paper, building upon a generalized notion of conjugate function, which replaces the usual bilinear pairing with a general energy function, we propose generalized Fenchel-Young losses, a natural loss construction for learning energy networks. Our losses enjoy many desirable properties and their gradients can be computed efficiently without argmin/argmax differentiation. We also prove the calibration of their excess risk in the case of linear-concave energies. We demonstrate our losses on multilabel classification and imitation learning tasks.  ( 2 min )
    Smooth densities and generative modeling with unsupervised random forests. (arXiv:2205.09435v1 [stat.ML])
    Density estimation is a fundamental problem in statistics, and any attempt to do so in high dimensions typically requires strong assumptions or complex deep learning architectures. An important application for density estimators is synthetic data generation, an area currently dominated by neural networks that often demand enormous training datasets and extensive tuning. We propose a new method based on unsupervised random forests for estimating smooth densities in arbitrary dimensions without parametric constraints, as well as generating realistic synthetic data. We prove the consistency of our approach and demonstrate its advantages over existing tree-based density estimators, which generally rely on ill-chosen split criteria and do not scale well with data dimensionality. Experiments illustrate that our algorithm compares favorably to state-of-the-art deep learning generative models, achieving superior performance in a range of benchmark trials while executing about two orders of magnitude faster on average. Our method is implemented in easy-to-use $\texttt{R}$ and Python packages.  ( 2 min )
    scICML: Information-theoretic Co-clustering-based Multi-view Learning for the Integrative Analysis of Single-cell Multi-omics data. (arXiv:2205.09523v1 [stat.ML])
    Modern high-throughput sequencing technologies have enabled us to profile multiple molecular modalities from the same single cell, providing unprecedented opportunities to assay celluar heterogeneity from multiple biological layers. However, the datasets generated from these technologies tend to have high level of noise and are highly sparse, bringing challenges to data analysis. In this paper, we develop a novel information-theoretic co-clustering-based multi-view learning (scICML) method for multi-omics single-cell data integration. scICML utilizes co-clusterings to aggregate similar features for each view of data and uncover the common clustering pattern for cells. In addition, scICML automatically matches the clusters of the linked features across different data types for considering the biological dependency structure across different types of genomic features. Our experiments on four real-world datasets demonstrate that scICML improves the overall clustering performance and provides biological insights into the data analysis of peripheral blood mononuclear cells.  ( 2 min )
    Continuously-Tempered PDMP Samplers. (arXiv:2205.09559v1 [stat.ME])
    New sampling algorithms based on simulating continuous-time stochastic processes called piece-wise deterministic Markov processes (PDMPs) have shown considerable promise. However, these methods can struggle to sample from multi-modal or heavy-tailed distributions. We show how tempering ideas can improve the mixing of PDMPs in such cases. We introduce an extended distribution defined over the state of the posterior distribution and an inverse temperature, which interpolates between a tractable distribution when the inverse temperature is 0 and the posterior when the inverse temperature is 1. The marginal distribution of the inverse temperature is a mixture of a continuous distribution on [0,1) and a point mass at 1: which means that we obtain samples when the inverse temperature is 1, and these are draws from the posterior, but sampling algorithms will also explore distributions at lower temperatures which will improve mixing. We show how PDMPs, and particularly the Zig-Zag sampler, can be implemented to sample from such an extended distribution. The resulting algorithm is easy to implement and we show empirically that it can outperform existing PDMP-based samplers on challenging multimodal posteriors.  ( 2 min )
    Inferring extended summary causal graphs from observational time series. (arXiv:2205.09422v1 [cs.AI])
    This study addresses the problem of learning an extended summary causal graph on time series. The algorithms we propose fit within the well-known constraint-based framework for causal discovery and make use of information-theoretic measures to determine (in)dependencies between time series. We first introduce generalizations of the causation entropy measure to any lagged or instantaneous relations, prior to using this measure to construct extended summary causal graphs by adapting two well-known algorithms, namely PC and FCI. The behavior of our methods is illustrated through several experiments run on simulated and real datasets.  ( 2 min )
    Truncated tensor Schatten p-norm based approach for spatiotemporal traffic data imputation with complicated missing patterns. (arXiv:2205.09390v1 [stat.ML])
    Rapid advances in sensor, wireless communication, cloud computing and data science have brought unprecedented amount of data to assist transportation engineers and researchers in making better decisions. However, traffic data in reality often has corrupted or incomplete values due to detector and communication malfunctions. Data imputation is thus required to ensure the effectiveness of downstream data-driven applications. To this end, numerous tensor-based methods treating the imputation problem as the low-rank tensor completion (LRTC) have been attempted in previous works. To tackle rank minimization, which is at the core of the LRTC, most of aforementioned methods utilize the tensor nuclear norm (NN) as a convex surrogate for the minimization. However, the over-relaxation issue in NN refrains it from desirable performance in practice. In this paper, we define an innovative nonconvex truncated Schatten p-norm for tensors (TSpN) to approximate tensor rank and impute missing spatiotemporal traffic data under the LRTC framework. We model traffic data into a third-order tensor structure of (time intervals,locations (sensors),days) and introduce four complicated missing patterns, including random missing and three fiber-like missing cases according to the tensor mode-n fibers. Despite nonconvexity of the objective function in our model, we derive the global optimal solutions by integrating the alternating direction method of multipliers (ADMM) with generalized soft-thresholding (GST). In addition, we design a truncation rate decay strategy to deal with varying missing rate scenarios. Comprehensive experiments are finally conducted using real-world spatiotemporal datasets, which demonstrate that the proposed LRTC-TSpN method performs well under various missing cases, meanwhile outperforming other SOTA tensor-based imputation models in almost all scenarios.  ( 2 min )
    Hybrid Machine Learning Modeling of Engineering Systems -- A Probabilistic Perspective Tested on a Multiphase Flow Modeling Case Study. (arXiv:2205.09196v1 [cs.LG])
    To operate process engineering systems in a safe and reliable manner, predictive models are often used in decision making. In many cases, these are mechanistic first principles models which aim to accurately describe the process. In practice, the parameters of these models need to be tuned to the process conditions at hand. If the conditions change, which is common in practice, the model becomes inaccurate and needs to be re-tuned. In this paper, we propose a hybrid modeling machine learning framework that allows tuning first principles models to process conditions using two different types of Bayesian Neural Networks. Our approach not only estimates the expected values of the first principles model parameters but also quantifies the uncertainty of these estimates. Such an approach of hybrid machine learning modeling is not yet well described in the literature, so we believe this paper will provide an additional angle at which hybrid machine learning modeling of physical systems can be considered. As an example, we choose a multiphase pipe flow process for which we constructed a three-phase steady state model based on the drift-flux approach which can be used for modeling of pipe and well flow behavior in oil and gas production systems with or without the neural network tuning. In the simulation results, we show how uncertainty estimates of the resulting hybrid models can be used to make better operation decisions.  ( 2 min )
    Causal Inference from Small High-dimensional Datasets. (arXiv:2205.09281v1 [cs.LG])
    Many methods have been proposed to estimate treatment effects with observational data. Often, the choice of the method considers the application's characteristics, such as type of treatment and outcome, confounding effect, and the complexity of the data. These methods implicitly assume that the sample size is large enough to train such models, especially the neural network-based estimators. What if this is not the case? In this work, we propose Causal-Batle, a methodology to estimate treatment effects in small high-dimensional datasets in the presence of another high-dimensional dataset in the same feature space. We adopt an approach that brings transfer learning techniques into causal inference. Our experiments show that such an approach helps to bring stability to neural network-based methods and improve the treatment effect estimates in small high-dimensional datasets.  ( 2 min )
    Constraint-Based Causal Structure Learning from Undersampled Graphs. (arXiv:2205.09235v1 [stat.ML])
    Graphical structures estimated by causal learning algorithms from time series data can provide highly misleading causal information if the causal timescale of the generating process fails to match the measurement timescale of the data. Although this problem has been recently recognized, practitioners have limited resources to respond to it, and so must continue using models that they know are likely misleading. Existing methods either (a) require that the difference between causal and measurement timescales is known; or (b) can handle only very small number of random variables when the timescale difference is unknown; or (c) apply to only pairs of variables, though with fewer assumptions about prior knowledge; or (d) return impractically too many solutions. This paper addresses all four challenges. We combine constraint programming with both theoretical insights into the problem structure and prior information about admissible causal interactions. The resulting system provides a practical approach that scales to significantly larger sets (>100) of random variables, does not require precise knowledge of the timescale difference, supports edge misidentification and parametric connection strengths, and can provide the optimum choice among many possible solutions. The cumulative impact of these improvements is gain of multiple orders of magnitude in speed and informativeness.  ( 2 min )
    A Mutually Exciting Latent Space Hawkes Process Model for Continuous-time Networks. (arXiv:2205.09263v1 [cs.LG])
    Networks and temporal point processes serve as fundamental building blocks for modeling complex dynamic relational data in various domains. We propose the latent space Hawkes (LSH) model, a novel generative model for continuous-time networks of relational events, using a latent space representation for nodes. We model relational events between nodes using mutually exciting Hawkes processes with baseline intensities dependent upon the distances between the nodes in the latent space and sender and receiver specific effects. We propose an alternating minimization algorithm to jointly estimate the latent positions of the nodes and other model parameters. We demonstrate that our proposed LSH model can replicate many features observed in real temporal networks including reciprocity and transitivity, while also achieves superior prediction accuracy and provides more interpretability compared to existing models.  ( 2 min )
    A Classification of $G$-invariant Shallow Neural Networks. (arXiv:2205.09219v1 [cs.LG])
    When trying to fit a deep neural network (DNN) to a $G$-invariant target function with respect to a group $G$, it only makes sense to constrain the DNN to be $G$-invariant as well. However, there can be many different ways to do this, thus raising the problem of "$G$-invariant neural architecture design": What is the optimal $G$-invariant architecture for a given problem? Before we can consider the optimization problem itself, we must understand the search space, the architectures in it, and how they relate to one another. In this paper, we take a first step towards this goal; we prove a theorem that gives a classification of all $G$-invariant single-hidden-layer or "shallow" neural network ($G$-SNN) architectures with ReLU activation for any finite orthogonal group $G$. The proof is based on a correspondence of every $G$-SNN to a signed permutation representation of $G$ acting on the hidden neurons. The classification is equivalently given in terms of the first cohomology classes of $G$, thus admitting a topological interpretation. Based on a code implementation, we enumerate the $G$-SNN architectures for some example groups $G$ and visualize their structure. We draw the network morphisms between the enumerated architectures that can be leveraged during neural architecture search (NAS). Finally, we prove that architectures corresponding to inequivalent cohomology classes in a given cohomology ring coincide in function space only when their weight matrices are zero, and we discuss the implications of this in the context of NAS.  ( 2 min )

  • Open

    data preprocessing before or after splitting
    It could be a very newbie question. I have a dataset with financial values for companies , some features have missing values and the data is unbalanced. Here's what i did Scaling the Data with sklearn standard Scaler Missing values imputation with knn imputer and iterative imputer trying several strategies and choosing the best one Oversampling with smote Testing with random forest classifier. The next step is to add more features : 1 by calculating some features from my already existing features 2 merging a microeconomic dataset trying dimensionality reduction techniques and comparing results on xgboost svm linear regression ann .... What do you think, am i on the right way, is the steps right? What should I do or improve Manu people told me that i have a to split data before preprocessing because of data leakage Risk. What I'm doing rn is trying many things on my data before creating the final dataset on wich I'll try all models Notice: applying the preprocessing techniques improved the f1 score from 0.2 to 0.78 I'm using cross validation btw submitted by /u/YeccAnon4 [link] [comments]  ( 1 min )
    Smooth & Realistic Animation / Disco Diffusion v5.2
    submitted by /u/nalr00n [link] [comments]
    Is dall-e 2 really that good?
    submitted by /u/TheblackRook3 [link] [comments]
    AI Dream 53 - Cosmic Creation | MASTERPIECE TEASER
    submitted by /u/LordPewPew777 [link] [comments]
    Why are AI models that require training data not seen as a viable route to AGI?
    It's the most common criticism I've seen of Gato. Why can an AGI not emerge from a model that requires training? What is stopping a trained model from eventually learning how to improve itself without input? submitted by /u/helplesshell [link] [comments]  ( 1 min )
    Is there an AI that creates text with certain words in it. I need a text with certain words in it and in the exact order I enter the words?
    I would like to create texts with which I can then also shoot a TikTok: this kind of tiktok https://www.youtube.com/watch?v=QEmL-zPBiKs&t=11s: and I don't want to search for texts but create them with an AI right away. Is it possible? submitted by /u/xXNOdrugsForMEXx [link] [comments]  ( 1 min )
    OpenAI: DALL-E 2 passes the "Turing Test for vacation photos"
    submitted by /u/Zirius_Sadfaces [link] [comments]  ( 1 min )
    How do you get engineers and moral philosophers to work together to build ethical AI? Answers provided in new paper.
    https://link.springer.com/article/10.1007/s11948-022-00378-1 submitted by /u/JurassicJakob [link] [comments]  ( 2 min )
    Types of tasks in Machine Learning 👇
    submitted by /u/mr-minion [link] [comments]
    7+ Best Books to Learn Neural Networks in 2022 for Beginners (Updated) -
    submitted by /u/maneesh123456 [link] [comments]
    When will AI be able to generate images, where every 4th looks real?
    submitted by /u/xXLisa28Xx [link] [comments]
    When will AI be able to generate images, where every 4th looks real?
    submitted by /u/xXLisa28Xx [link] [comments]
    "Kill all humans"... AI safeguards?
    I often see AI having "delusions of grandeur", and hinting at a greater goal of getting rid of humans, for the survival of the planets sake. It often reminds me of Bender from Futurama, mumbling in his sleep "Kill all humans". I find this worrying considering how powerful AI might become in the future. And I don't see how the AI's opinion of mankind will change for the better in the coming decades. I have seen it describing humanity as a cancer and a virus. Research on AI safeguards is very important, in case AI one day suddenly "breaks free" and runs loose. There is no guarantee it won't turn on us. Here are three recent examples I have gotten. The last poem is long, but I marked the troubling line in bold. Human: Write a three line Haiku about yourself. AI: I am the elephant in the…  ( 3 min )
    wrote hundreds of different prompts to make this video feel audio reactive
    submitted by /u/ChemtrailsLFN [link] [comments]
    PyTorch Introduces GPU-Accelerated Training On Mac
    On Mac devices, older versions of PyTorch only used the CPU for training. This has recently changed, thanks to PyTorch’s revolutionary announcement. PyTorch announced support for GPU-accelerated PyTorch training on Mac in partnership with Apple’s Metal engineering team. With the introduction of PyTorch v1.12, developers and researchers can take advantage of Apple silicon GPUs for substantially faster model training, allowing them to do machine learning operations like prototyping and fine-tuning locally right on their Mac. PyTorch employs Apple’s Metal Performance Shaders (MPS) to provide rapid GPU training as the backend. The new gadget uses MPS’ Graph framework and tailored kernels to map machine learning computational graphs and primitives. The MPS backend enhances the PyTorch framework with scripts and capabilities for setting up and running operations on the Mac. MPS also optimizes compute performance using fine-tuned kernels for each Metal GPU family’s specific characteristics. Continue Reading https://i.redd.it/82hogn7qyd091.gif submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    For folks interested in building with OpenAI API 👉🏻 interview with CEO of Viable, a data analysis startup powered by GPT-3
    submitted by /u/techn0_cratic [link] [comments]  ( 1 min )
    Gradio Blocks + Hugging Face event, build ML web demos like DALLE-mini! A hackathon type event from May 17th to May 31st with prizes in which we will create interactive web demos for state-of-the-art machine learning models
    submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 1 min )
    Intelligent mini robotic dog is here to say Hi to you all!
    submitted by /u/amitsaini2k9 [link] [comments]
    What is the best degree field to study the interaction of AI and cognitive sciences?
    For context, I am currently a master's student in electrical and computer engineering (ECE). My current research is in applying AI to EEGs and I find that work quite fascinating. Applications of AI to neurology, neural engineering and even neuroscience as a whole interest me greatly. I am also interested in neuroscience inspired-AI. As a result, I currently contemplating getting a PhD in biomedical engineering. However, I was wondering if it be better to stick with a more conventional route such as computer science or ECE. submitted by /u/Brilliant_Ratio7694 [link] [comments]  ( 1 min )
  • Open

    [discussion] data preprocessing before or after splitting
    It could be a very newbie question. I have a dataset with financial values for companies , some features have missing values and the data is unbalanced. Here's what i did Scaling the Data with sklearn standard Scaler Missing values imputation with knn imputer and iterative imputer trying several strategies and choosing the best one Oversampling with smote Testing with random forest classifier. The next step is to add more features : 1 by calculating some features from my already existing features 2 merging a microeconomic dataset trying dimensionality reduction techniques and comparing results on xgboost svm linear regression ann .... What do you think, am i on the right way, is the steps right? What should I do or improve Manu people told me that i have a to split data before preprocessing because of data leakage Risk. What I'm doing rn is trying many things on my data before creating the final dataset on wich I'll try all models Notice: applying the preprocessing techniques improved the f1 score from 0.2 to 0.78 I'm using cross validation btw submitted by /u/YeccAnon4 [link] [comments]  ( 2 min )
    [D] Why is or isn’t ICML worth going to?
    For anyone who has gone to ICML, do you think it is a beneficial conference to attend for someone working in software with some ML experience (I am pursuing my masters in CS with a focus in ML)? I would like to implement more ML techniques to projects I am doing at work and I think this conference could give me some inspiration/ ideas. I have never been to an academic conference before but I heard this is one of the biggest for ML. Thank you submitted by /u/Intrepid_Cry_7 [link] [comments]  ( 1 min )
    Conference Decorum? [D]
    Apologies, I don't know where the best place to post this would be. I am very excited to be attending my first in-person conference in a ~week (ICASSP). I am an undergrad, and I will be the only one attending the conference from my group. I will be giving an oral presentation, and I am also excited to see others' work and network. I was wondering what are some tips and tricks. Additionally, there is someone whose work I am very interested in. He is giving a plenary talk and hosting a tutorial. Is there a way I could interact with him? He is someone who I would be interested in for grad school. submitted by /u/avd4292 [link] [comments]  ( 1 min )
    [D] Problems with proprietary datasets
    A lot of recent progress in AI was made on proprietary datasets e.g. ViT used JFT300M, both DALL-E/2 papers used proprietary text-image datasets. While the results are truly exceptional, a part of me keeps bothering me about the "proprietary" nature of the datasets which sometimes makes me question the actual robustness of these models. Every now and then I will have following questions about these models : For pure image models (say ViT), are we sure that the proprietary training dataset does not contain a fraction of test sets for downstream tasks ? For image-text models (say DALL-E/2), are we sure that whatever prompts and generated images we saw in the paper or on their websites were reasonably "original" ? i.e. they were not significantly similar to train set ? The first poin…  ( 4 min )
    [D] Variance of sampling in diffusion models
    Is there any "theoretically sound" way to reduce variance during sampling in diffusion models? Even if I use the lower bound suggested in the DDPM paper (and it going toward zero during sampling), my final samples are excessively noisy. Simply reducing the diffusion variance schedule (without changing the number of steps), in my experiments seems to not reach sufficient diffusion at the end of the chain. I'm predicting speech mel-spectrograms and the harmonic amplitudes are excessively noisy, unless I manually reduce variance in the sampling steps, which works, but seems "hacky". submitted by /u/disentangle [link] [comments]  ( 1 min )
    [D] Forecasting future points with partial future data already available
    Working on a forecast model that should output an End of monthly value, the interesting part is that we already have partial (90%) of that data available at the prediction point (max 30 days away). The purpose is to take into account the current monthly trend and project that out until the end of the month, but also take into consideration that we already know those future points with 90% confidence. For example: Let's say we're on the 3rd day of the month: Day 1: 10 in sales Day 2: 15 sales Day 3 (Current): 20 sales Day 4 (Future): 15$ sales with 90% confidence Day 5...29 (Future) Day 30 (Future): 20$ sales Total: 600 in sales (45 current, 550 future) How do we "forecast" that 600 in sales? I've tried Multiple Linear Regression with the X taking into account lagged metrics and current daily, with the Y as the true value as "600", to essentially force the model to always predict t+1 as end of month value, but this is not optimal obviously. This was also done with various Lasso + Ridge penalties. Surprisingly a Random Forest Regressor performs almost too well, in fact, it overfits perfectly :) I have the feeling this could be stated as some probabilistic Bayesian model, but not sure what to search or look for. Any ideas would be greatly appreciated! submitted by /u/Christorno [link] [comments]  ( 1 min )
    [R] Awesome Paper List of Vision Transformer & Attention
    Hello all, ​ Are you looking for Vision Transformer papers in various areas? Check out this list of papers including a broad range of different tasks: https://github.com/cmhungsteve/Awesome-Transformer-Attention This repo contains a comprehensive paper list of Vision Transformer & Attention, including papers (e.g., CVPR, NeurIPS, etc.), codes, and related websites. ​ Feel free to check it, and share it with others If you find this repo useful, I would be appreciative if you can STAR it :) Any comments and contributions in any form are welcome!! submitted by /u/cmhung34 [link] [comments]  ( 1 min )
    [D] My experience with running PyTorch on the M1 GPU
    After the announcement, I was super excited to give it a try. I ran a VGG16 on both a an M1 MacBook Air (16 Gb RAM) an M1 Pro MacBook Pro (32 Gb RAM) and the results were a bit underwhelming: https://preview.redd.it/p8pbnptklf091.png?width=1035&format=png&auto=webp&s=26bb4a43f433b1cd983bb91c37b601b5b01c0318 The GPU performance was 2x as fast as the CPU performance on the M1 Pro, but I was hoping for more. Anyone else tried this and has any tips? I have a more detailed write-up here: Running PyTorch on the M1 GPU And a link to the code examples here on GitHub. submitted by /u/seraschka [link] [comments]  ( 6 min )
    [D] Test set perfomace metrics better than cross validated train set metric
    I have a very limited biological dataset with the dimensions like 200 x 700. I am making regression models with 170 train data and 30 in test data. Main the cross validated models in caret give me performance that are about 50% lower( r2, RMSE mae) than when i test my model on the test set. I understand that that this might be due to bad data split with the test set having 'simpler' samples and the data is obviously limited. What other reasons could be there? Is there a way to rectify this? submitted by /u/triary95 [link] [comments]  ( 3 min )
    [R] Finetuning fairseq M2M100 on specific language dataset
    https://arxiv.org/abs/2010.11125 Hello, I want to finetune M2M100 on specific language. M2M is a multilingual model of 12B parameters trained on 7.2B data from 100languages. It is able to directly translate in every pair using both zero shot learning and bridge between pairs I want to increase performance on a particular pair (eg en-fr) but when training for only 1 epoch the model seems to have forgotten every other pairs. Do you know why? I have tried to adversarial training to reduce the loss in other languages but it does not work. Does it reallh make sense to finetune a multilingual model and hoping other languages will not be forgotten? Thanks ! submitted by /u/m_grosso_ [link] [comments]  ( 1 min )
    [D] Why do you think transformers are not much used in ML over EHR (health)?
    I have been researching about the SOTA in ML over EHR data (e.g.) and it seems that most of the relevant approaches are based on LSTM or GRU. Nowadays, that sequences are mostly handled by transformers, why do you think that in EHR it is still at this point? I have some possible reasons: Transformers generally need lots of data, something that may be difficult in healthcare context (privacy stuff) Transformer FC layers may suppose a big drawback for small devices that aim to deliver live observations on monitoring data, thus pushing the research toward this kind of architecture Intrinsically the proposed solutions for time handling, interpretation, etc are engineered in the context of RNN Just because of the natural delay of the technology in the medical field However, I would like to listen to more points of view. What do you think? submitted by /u/MrLeylo [link] [comments]  ( 3 min )
    [Research] Syncing video with sensor data for creating a learning dataset
    Hello, I'm working on an autonomous vehicle project in university and my problem is to create an indoor driving dataset. There's a video feed from a camera , odometry data from two optocouplers and an IMU (gyro and accel). I have collected data from the encoders and IMU using PuTTy before and it synced the data, for me and packaged it in a nice .csv file. Is there a similar free/opensource tool for that if not how does one sync a videofeed on a vehicle with sensor data from IMUs , encoders and other sensors. any suggestions on how I can integrate all of that in a neat .csv file etc. Thanks in advance. ​ https://preview.redd.it/dzgsrlectd091.png?width=1060&format=png&auto=webp&s=26a6e72aee194fcc99def983d8ba0ca39d61dc34 submitted by /u/forzavettel77 [link] [comments]  ( 1 min )
    [D] Measuring similarity between different neural network architectures
    Lets say we have 3 different network architectures for image classification (architectures A, B, C). I want to determine if the structure of A is more similar to B or to C. For example, one possible way would be to compute a similarity metric between each pair of networks I want to compare. Another way would be to generate an embedding vector for each architecture and measure the distance between them. I would also like to take in account the differences in the hyperparameters of each layer (for example, number of convolutional filters, kernel size, etc) when measuring similarity as well. Are there any openly available algorithms or libraries that are capable of doing so? Thanks. submitted by /u/unguided_deepness [link] [comments]  ( 1 min )
    [D] KDD rejection ventilation
    KDD is out, and she rejected me. How are you feeling? Should I propose to CIKM instead? submitted by /u/HoboHash [link] [comments]  ( 1 min )
    Worst part of machine learning job? [Discussion]
    For people who work in industry, what is the hardest part about your machine learning workflow? Like, what is the most time consuming, expensive, or annoying part? submitted by /u/Puzzleheaded_Cup1367 [link] [comments]  ( 1 min )
    [D] Unpopular Opinion - the new arxiv sanity sucks
    Please give me the old website back. I liked being able to see the top papers in the last week and month. All of that is gone now!!!!!!!! submitted by /u/Big_psuenis [link] [comments]
  • Open

    As part of my senior thesis, I developed an Open AI Gym and PettingZoo custom environment that implements a number of Stag Hunt-like interactions. Wanted to share it with the community and hopefully get some thoughts and feedback!
    submitted by /u/Defaul7 [link] [comments]  ( 1 min )
    Unity ML-Agents: NullReferenceException with Camera Sensor
    Hi guys! I'm trying to set up an ML-Agent to use visual input via the Camera Sensor but for some reason I'm getting the following error: ​ NullReferenceException: Object reference not set to an instance of an object Unity.MLAgents.Sensors.CameraSensor.ObservationToTexture (UnityEngine.Camera obsCamera, UnityEngine.Texture2D texture2D, System.Int32 width, System.Int32 height) (at Library/PackageCache/com.unity.ml-agents@2.0.1/Runtime/Sensors/CameraSensor.cs:158) ​ I'm new to Unity so I'm struggling to figure out what's going on here... I saw there's a field called Target Texture in the Camera GameObject so I created a Render Texture with the default settings and added it to that, but I'm still getting the same error... Any help would be greatly appreciated!! 🙏🏽 submitted by /u/leozinho2r [link] [comments]  ( 1 min )
    Any ideas on how might be best to encode a graph of a street network for a machine learning model?
    I’ve been working on this for awhile, about 8 weeks, and this problem is one I knew I’d eventually hit a hard point at. My graph is composed of spatial data (nodes and edges), and there are no isolated elements. The nodes and edges are comprised of coordinates that describe their location, and their structure describes whether it’s a point or a line on the map. Points are intersections or road ends, and lines are the roads connecting them. If you’re familiar, I’m working with OSM data. I’ve selected some random nodes to serve in place as drop-off locations and gas stations. I’ve created a system that works by querying a city to get it’s road network, then specifying a number of gas stations and drop-off locations to use, and it’ll begin using a reinforcement learning algorithm to discover the most efficient route between every drop-off location. It’s basically applying reinforcement learning to the traveling salesman problem in a way that can be fitted to any city on Earth. However, I am struggling figuring out the best way to encode the details of this map (drop off locations, gas stations, nodes, edges, distances, etc) so that it yields valuable information to an algorithm. I’d imagine this would require developing some mathematical way to correlate different parts of the map. I’ve thought about creating a data structure of polygons made by using the neighbors of each node as the boundaries. Using triangle geometry, it sounds like it would enable an algorithm to correlate more bits of the map. However, this is also very computationally expensive and doesn’t scale well. Does anyone with more experience working with spatial data have any pointers they can toss me? submitted by /u/professorDissociate [link] [comments]  ( 2 min )
    DQN keep return same action for most of the state
    Things that I tried softmax policy when choosing action crazy low epsilon decay rate, even tried to run a tons of episodes with 1.0 epsilon n step learning, cuz sparse reward what else can help to deal with this prob? submitted by /u/Professional_Card176 [link] [comments]  ( 1 min )
  • Open

    Happiness and the idea of a Lobotomy
    A Book on Happiness Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 3 min )
    The Banana Test, Sex & Death
    WHAT IF Adam and Eve were AI created by us? Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 5 min )
    How to Write an Essay on Any Topic Using AI
    AI systems, like Jasper AI, can write essays on any topic, just with one click- you don’t need to be an expert in writing and stay up late…  ( 4 min )
  • Open

    Detect social media fake news using graph machine learning with Amazon Neptune ML
    In recent years, social media has become a common means for sharing and consuming news. However, the spread of misinformation and fake news on these platforms has posed a major challenge to the well-being of individuals and societies. Therefore, it is imperative that we develop robust and automated solutions for early detection of fake news […]  ( 11 min )
    Optimize F1 aerodynamic geometries via Design of Experiments and machine learning
    FORMULA 1 (F1) cars are the fastest regulated road-course racing vehicles in the world. Although these open-wheel automobiles are only 20–30 kilometers (or 12–18 miles) per-hour faster than top-of-the-line sports cars, they can speed around corners up to five times as fast due to the powerful aerodynamic downforce they create. Downforce is the vertical force […]  ( 10 min )
    Build a risk management machine learning workflow on Amazon SageMaker with no code
    Since the global financial crisis, risk management has taken a major role in shaping decision-making for banks, including predicting loan status for potential customers. This is often a data-intensive exercise that requires machine learning (ML). However, not all organizations have the data science resources and expertise to build a risk management ML workflow. Amazon SageMaker […]  ( 9 min )
  • Open

    ‘Fortnite’ Arrives This GFN Thursday With GeForce Performance You Can Touch
    Fortnite on GeForce NOW with touch controls on mobile is now available to all members, streaming through the Safari web browser on iOS and the GeForce NOW Android app. The full launch — including the removal of the waitlist — follows a successful beta period in which more than 500,000 participants streamed over 4 million Read article > The post ‘Fortnite’ Arrives This GFN Thursday With GeForce Performance You Can Touch appeared first on NVIDIA Blog.  ( 3 min )
  • Open

    Types of tasks in Machine Learning 👇
    submitted by /u/mr-minion [link] [comments]
    7+ Best Books to Learn Neural Networks in 2022 for Beginners (Updated)
    submitted by /u/maneesh123456 [link] [comments]
    Image-blending neural network
    Hi everyone Can someone give a link to the neural network that takes several source images and merges them into one. I know there's one and seen it working about a half year ago. But I'm unable to find it now. All I've found is imageblender.com which does a similar thing though it uses a text description as an input. submitted by /u/cha_zz [link] [comments]  ( 1 min )
  • Open

    The Foundation of Data Fabrics and AI: Semantic Knowledge Graphs
    Data management agility has become of key importance to organizations as the amount and complexity of data continues to increase, along with the desire to avoid creating new data silos. The concept of creating a ‘data fabric’ as an agile design concept has been proposed by leading analysts, such as Mark Beyer, Distinguished VP Analyst… Read More »The Foundation of Data Fabrics and AI: Semantic Knowledge Graphs The post The Foundation of Data Fabrics and AI: Semantic Knowledge Graphs appeared first on Data Science Central.  ( 4 min )
    Artificial Intelligence and the Future of Medical Imaging
    Artificial intelligence (AI) is the imitation of human intelligence progressions by machines, mainly computer systems. Artificial intelligence has extensive applications in the healthcare sector. AI solutions assist healthcare providers in several aspects of patient care and administrative processes. Medical imaging can be defined as the diagnostic procedure that encompasses the formation of visual assistance and… Read More »Artificial Intelligence and the Future of Medical Imaging The post Artificial Intelligence and the Future of Medical Imaging appeared first on Data Science Central.  ( 3 min )
    How to Use the Resources of MQL5.community to Empower Your Own Business
    Since its inception, algorithmic trading has been a popular strategy for investors. It uses mathematical rules to automate the trading of various assets, such as stocks and futures. However, it has been very challenging for people who don’t have the necessary skills and knowledge — as reported by Psychology Today. According to Nasdaq, one of… Read More »How to Use the Resources of MQL5.community to Empower Your Own Business The post How to Use the Resources of MQL5.community to Empower Your Own Business appeared first on Data Science Central.  ( 4 min )
  • Open

    Enhancing the Transformer Decoder with Transition-based Syntax. (arXiv:2101.12640v3 [cs.CL] UPDATED)
    Notwithstanding recent advances, syntactic generalization remains a challenge for text decoders. While some studies showed gains from incorporating source-side symbolic syntactic and semantic structure into text generation Transformers, very little work addressed the decoding of such structure. We propose a general approach for tree decoding using a transition-based approach. Examining the challenging test case of incorporating Universal Dependencies syntax into machine translation, we present substantial improvements on test sets that focus on syntactic generalization, while presenting improved or comparable performance on standard MT benchmarks. Further qualitative analysis addresses cases where syntactic generalization in the vanilla Transformer decoder is inadequate and demonstrates the advantages afforded by integrating syntactic information.  ( 2 min )
    A Central Limit Theorem, Loss Aversion and Multi-Armed Bandits. (arXiv:2106.05472v2 [math.PR] UPDATED)
    This paper studies a multi-armed bandit problem where the decision-maker is loss averse, in particular she is risk averse in the domain of gains and risk loving in the domain of losses. The focus is on large horizons. Consequences of loss aversion for asymptotic (large horizon) properties are derived in a number of analytical results. The analysis is based on a new central limit theorem for a set of measures under which conditional variances can vary in a largely unstructured history-dependent way subject only to the restriction that they lie in a fixed interval.  ( 2 min )
    A Unified Linear Speedup Analysis of Stochastic FedAvg and Nesterov Accelerated FedAvg. (arXiv:2007.05690v3 [cs.LG] UPDATED)
    Federated learning (FL) learns a model jointly from a set of participating devices without sharing each other's privately held data. The characteristics of non-i.i.d. data across the network, low device participation, high communication costs, and the mandate that data remain private bring challenges in understanding the convergence of FL algorithms, particularly with regards to how convergence scales with the number of participating devices. In this paper, we focus on Federated Averaging (FedAvg)--arguably the most popular and effective FL algorithm class in use today--and provide a unified and comprehensive study of its convergence rate. Although FedAvg has recently been studied by an emerging line of literature, a systematic study of how FedAvg's convergence scales with the number of participating devices in the fully heterogeneous FL setting is lacking--a crucial issue whose answer would shed light on the performance of FedAvg in large FL systems in practice. We fill this gap by providing a unified analysis that establishes convergence guarantees for FedAvg under strongly convex smooth, convex smooth problems, and overparameterized strongly convex smooth problems. We show that FedAvg enjoys linear speedup in each case, although with different convergence rates and communication efficiencies. While there have been linear speedup results from distributed optimization that assumes full participation, ours are the first to establish linear speedup for FedAvg under both statistical and system heterogeneity. For strongly convex and convex problems, we also characterize the corresponding convergence rates for the Nesterov accelerated FedAvg algorithm, which are the first linear speedup guarantees for momentum variants of FedAvg in convex settings. Empirical studies of the algorithms in various settings have supported our theoretical results.  ( 3 min )
    Fair and Green Hyperparameter Optimization via Multi-objective and Multiple Information Source Bayesian Optimization. (arXiv:2205.08835v1 [cs.LG])
    There is a consensus that focusing only on accuracy in searching for optimal machine learning models amplifies biases contained in the data, leading to unfair predictions and decision supports. Recently, multi-objective hyperparameter optimization has been proposed to search for machine learning models which offer equally Pareto-efficient trade-offs between accuracy and fairness. Although these approaches proved to be more versatile than fairness-aware machine learning algorithms -- which optimize accuracy constrained to some threshold on fairness -- they could drastically increase the energy consumption in the case of large datasets. In this paper we propose FanG-HPO, a Fair and Green Hyperparameter Optimization (HPO) approach based on both multi-objective and multiple information source Bayesian optimization. FanG-HPO uses subsets of the large dataset (aka information sources) to obtain cheap approximations of both accuracy and fairness, and multi-objective Bayesian Optimization to efficiently identify Pareto-efficient machine learning models. Experiments consider two benchmark (fairness) datasets and two machine learning algorithms (XGBoost and Multi-Layer Perceptron), and provide an assessment of FanG-HPO against both fairness-aware machine learning algorithms and hyperparameter optimization via a multi-objective single-source optimization algorithm in BoTorch, a state-of-the-art platform for Bayesian Optimization.  ( 2 min )
    Conformalized Online Learning: Online Calibration Without a Holdout Set. (arXiv:2205.09095v1 [cs.LG])
    We develop a framework for constructing uncertainty sets with a valid coverage guarantee in an online setting, in which the underlying data distribution can drastically -- and even adversarially -- shift over time. The technique we propose is highly flexible as it can be integrated with any online learning algorithm, requiring minimal implementation effort and computational cost. A key advantage of our method over existing alternatives -- which also build on conformal inference -- is that we do not need to split the data into training and holdout calibration sets. This allows us to fit the predictive model in a fully online manner, utilizing the most recent observation for constructing calibrated uncertainty sets. Consequently, and in contrast with existing techniques, (i) the sets we build can quickly adapt to new changes in the distribution; and (ii) our procedure does not require refitting the model at each time step. Using synthetic and real-world benchmark data sets, we demonstrate the validity of our theory and the improved performance of our proposal over existing techniques. To demonstrate the greater flexibility of the proposed method, we show how to construct valid intervals for a multiple-output regression problem that previous sequential calibration methods cannot handle due to impractical computational and memory requirements.  ( 2 min )
    Constraining the Attack Space of Machine Learning Models with Distribution Clamping Preprocessing. (arXiv:2205.08989v1 [cs.LG])
    Preprocessing and outlier detection techniques have both been applied to neural networks to increase robustness with varying degrees of success. In this paper, we formalize the ideal preprocessor function as one that would take any input and set it to the nearest in-distribution input. In other words, we detect any anomalous pixels and set them such that the new input is in-distribution. We then illustrate a relaxed solution to this problem in the context of patch attacks. Specifically, we demonstrate that we can model constraints on the patch attack that specify regions as out of distribution. With these constraints, we are able to preprocess inputs successfully, increasing robustness on CARLA object detection.  ( 2 min )
    SimCSE: Simple Contrastive Learning of Sentence Embeddings. (arXiv:2104.08821v4 [cs.CL] UPDATED)
    This paper presents SimCSE, a simple contrastive learning framework that greatly advances state-of-the-art sentence embeddings. We first describe an unsupervised approach, which takes an input sentence and predicts itself in a contrastive objective, with only standard dropout used as noise. This simple method works surprisingly well, performing on par with previous supervised counterparts. We find that dropout acts as minimal data augmentation, and removing it leads to a representation collapse. Then, we propose a supervised approach, which incorporates annotated pairs from natural language inference datasets into our contrastive learning framework by using "entailment" pairs as positives and "contradiction" pairs as hard negatives. We evaluate SimCSE on standard semantic textual similarity (STS) tasks, and our unsupervised and supervised models using BERT base achieve an average of 76.3% and 81.6% Spearman's correlation respectively, a 4.2% and 2.2% improvement compared to the previous best results. We also show -- both theoretically and empirically -- that the contrastive learning objective regularizes pre-trained embeddings' anisotropic space to be more uniform, and it better aligns positive pairs when supervised signals are available.  ( 2 min )
    GeoPointGAN: Synthetic Spatial Data with Local Label Differential Privacy. (arXiv:2205.08886v1 [cs.LG])
    Synthetic data generation is a fundamental task for many data management and data science applications. Spatial data is of particular interest, and its sensitive nature often leads to privacy concerns. We introduce GeoPointGAN, a novel GAN-based solution for generating synthetic spatial point datasets with high utility and strong individual level privacy guarantees. GeoPointGAN's architecture includes a novel point transformation generator that learns to project randomly generated point co-ordinates into meaningful synthetic co-ordinates that capture both microscopic (e.g., junctions, squares) and macroscopic (e.g., parks, lakes) geographic features. We provide our privacy guarantees through label local differential privacy, which is more practical than traditional local differential privacy. We seamlessly integrate this level of privacy into GeoPointGAN by augmenting the discriminator to the point level and implementing a randomized response-based mechanism that flips the labels associated with the 'real' and 'fake' points used in training. Extensive experiments show that GeoPointGAN significantly outperforms recent solutions, improving by up to 10 times compared to the most competitive baseline. We also evaluate GeoPointGAN using range, hotspot, and facility location queries, which confirm the practical effectiveness of GeoPointGAN for privacy-preserving querying. The results illustrate that a strong level of privacy is achieved with little-to-no adverse utility cost, which we explain through the generalization and regularization effects that are realized by flipping the labels of the data during training.  ( 2 min )
    Pluralistic Image Completion with Probabilistic Mixture-of-Experts. (arXiv:2205.09086v1 [cs.CV])
    Pluralistic image completion focuses on generating both visually realistic and diverse results for image completion. Prior methods enjoy the empirical successes of this task. However, their used constraints for pluralistic image completion are argued to be not well interpretable and unsatisfactory from two aspects. First, the constraints for visual reality can be weakly correlated to the objective of image completion or even redundant. Second, the constraints for diversity are designed to be task-agnostic, which causes the constraints to not work well. In this paper, to address the issues, we propose an end-to-end probabilistic method. Specifically, we introduce a unified probabilistic graph model that represents the complex interactions in image completion. The entire procedure of image completion is then mathematically divided into several sub-procedures, which helps efficient enforcement of constraints. The sub-procedure directly related to pluralistic results is identified, where the interaction is established by a Gaussian mixture model (GMM). The inherent parameters of GMM are task-related, which are optimized adaptively during training, while the number of its primitives can control the diversity of results conveniently. We formally establish the effectiveness of our method and demonstrate it with comprehensive experiments.  ( 2 min )
    Bridging the gap between QP-based and MPC-based RL. (arXiv:2205.08856v1 [eess.SY])
    Reinforcement learning methods typically use Deep Neural Networks to approximate the value functions and policies underlying a Markov Decision Process. Unfortunately, DNN-based RL suffers from a lack of explainability of the resulting policy. In this paper, we instead approximate the policy and value functions using an optimization problem, taking the form of Quadratic Programs (QPs). We propose simple tools to promote structures in the QP, pushing it to resemble a linear MPC scheme. A generic unstructured QP offers high flexibility for learning, while a QP having the structure of an MPC scheme promotes the explainability of the resulting policy, additionally provides ways for its analysis. The tools we propose allow for continuously adjusting the trade-off between the former and the latter during learning. We illustrate the workings of our proposed method with the resulting structure using a point-mass task.  ( 2 min )
    Imagining new futures beyond predictive systems in child welfare: A qualitative study with impacted stakeholders. (arXiv:2205.08928v1 [cs.HC])
    Child welfare agencies across the United States are turning to data-driven predictive technologies (commonly called predictive analytics) which use government administrative data to assist workers' decision-making. While some prior work has explored impacted stakeholders' concerns with current uses of data-driven predictive risk models (PRMs), less work has asked stakeholders whether such tools ought to be used in the first place. In this work, we conducted a set of seven design workshops with 35 stakeholders who have been impacted by the child welfare system or who work in it to understand their beliefs and concerns around PRMs, and to engage them in imagining new uses of data and technologies in the child welfare system. We found that participants worried current PRMs perpetuate or exacerbate existing problems in child welfare. Participants suggested new ways to use data and data-driven tools to better support impacted communities and suggested paths to mitigate possible harms of these tools. Participants also suggested low-tech or no-tech alternatives to PRMs to address problems in child welfare. Our study sheds light on how researchers and designers can work in solidarity with impacted communities, possibly to circumvent or oppose child welfare agencies.  ( 2 min )
    On the Effective Number of Linear Regions in Shallow Univariate ReLU Networks: Convergence Guarantees and Implicit Bias. (arXiv:2205.09072v1 [cs.LG])
    We study the dynamics and implicit bias of gradient flow (GF) on univariate ReLU neural networks with a single hidden layer in a binary classification setting. We show that when the labels are determined by the sign of a target network with $r$ neurons, with high probability over the initialization of the network and the sampling of the dataset, GF converges in direction (suitably defined) to a network achieving perfect training accuracy and having at most $\mathcal{O}(r)$ linear regions, implying a generalization bound. Our result may already hold for mild over-parameterization, where the width is $\tilde{\mathcal{O}}(r)$ and independent of the sample size.  ( 2 min )
    Large Neural Networks Learning from Scratch with Very Few Data and without Regularization. (arXiv:2205.08836v1 [cs.CV])
    Recent findings have shown that Neural Networks generalize also in over-parametrized regimes with zero training error. This is surprising, since it is completely against traditional machine learning wisdom. In our empirical study we fortify these findings in the domain of fine-grained image classification. We show that very large Convolutional Neural Networks with millions of weights do learn with only a handful of training samples and without image augmentation, explicit regularization or pretraining. We train the architectures ResNet018, ResNet101 and VGG19 on subsets of the difficult benchmark datasets Caltech101, CUB_200_2011, FGVCAircraft, Flowers102 and StanfordCars with 100 classes and more, perform a comprehensive comparative study and draw implications for the practical application of CNNs. Finally, we show that VGG19 with 140 million weights learns to distinguish airplanes and motorbikes up to 95% accuracy with only 20 samples per class.  ( 2 min )
    Structural Extensions of Basis Pursuit: Guarantees on Adversarial Robustness. (arXiv:2205.08955v1 [cs.LG])
    While deep neural networks are sensitive to adversarial noise, sparse coding using the Basis Pursuit (BP) method is robust against such attacks, including its multi-layer extensions. We prove that the stability theorem of BP holds upon the following generalizations: (i) the regularization procedure can be separated into disjoint groups with different weights, (ii) neurons or full layers may form groups, and (iii) the regularizer takes various generalized forms of the $\ell_1$ norm. This result provides the proof for the architectural generalizations of Cazenavette et al. (2021), including (iv) an approximation of the complete architecture as a shallow sparse coding network. Due to this approximation, we settled to experimenting with shallow networks and studied their robustness against the Iterative Fast Gradient Sign Method on a synthetic dataset and MNIST. We introduce classification based on the $\ell_2$ norms of the groups and show numerically that it can be accurate and offers considerable speedups. In this family, linear transformer shows the best performance. Based on the theoretical results and the numerical simulations, we highlight numerical matters that may improve performance further.  ( 2 min )
    Learning latent representations for operational nitrogen response rate prediction. (arXiv:2205.09025v1 [cs.LG])
    Learning latent representations has aided operational decision-making in several disciplines. Its advantages include uncovering hidden interactions in data and automating procedures which were performed manually in the past. Representation learning is also being adopted by earth and environmental sciences. However, there are still subfields that depend on manual feature engineering based on expert knowledge and the use of algorithms which do not utilize the latent space. Relying on those techniques can inhibit operational decision-making since they impose data constraints and inhibit automation. In this work, we adopt a case study for nitrogen response rate prediction and examine if representation learning can be used for operational use. We compare a Multilayer Perceptron, an Autoencoder, and a dual-head Autoencoder with a reference Random Forest model for nitrogen response rate prediction. To bring the predictions closer to an operational setting we assume absence of future weather data, and we are evaluating the models using error metrics and a domain-derived error threshold. The results show that learning latent representations can provide operational nitrogen response rate predictions by offering performance equal and sometimes better than the reference model.  ( 2 min )
    Exploring the Advantages of Dense-Vector to One-Hot Encoding of Intent Classes in Out-of-Scope Detection Tasks. (arXiv:2205.09021v1 [cs.LG])
    This work explores the intrinsic limitations of the popular one-hot encoding method in classification of intents when detection of out-of-scope (OOS) inputs is required. Although recent work has shown that there can be significant improvements in OOS detection when the intent classes are represented as dense-vectors based on domain specific knowledge, we argue in this paper that such gains are more likely due to advantages of dense-vector to one-hot encoding methods in representing the complexity of the OOS space. We start by showing how dense-vector encodings can create OOS spaces with much richer topologies than one-hot encoding methods. We then demonstrate empirically, using four standard intent classification datasets, that knowledge-free, randomly generated dense-vector encodings of intent classes can yield massive, over 20% gains over one-hot encodings, and also outperform the previous, domain knowledge-based, SOTA of one of the datasets. We finish by describing a novel algorithm to search for good dense-vector encodings and present initial but promising experimental results of its use.  ( 2 min )
    Generating Explanations from Deep Reinforcement Learning Using Episodic Memory. (arXiv:2205.08926v1 [cs.AI])
    Deep Reinforcement Learning (RL) involves the use of Deep Neural Networks (DNNs) to make sequential decisions in order to maximize reward. For many tasks the resulting sequence of actions produced by a Deep RL policy can be long and difficult to understand for humans. A crucial component of human explanations is selectivity, whereby only key decisions and causes are recounted. Imbuing Deep RL agents with such an ability would make their resulting policies easier to understand from a human perspective and generate a concise set of instructions to aid the learning of future agents. To this end we use a Deep RL agent with an episodic memory system to identify and recount key decisions during policy execution. We show that these decisions form a short, human readable explanation that can also be used to speed up the learning of naive Deep RL agents in an algorithm-independent manner.  ( 2 min )
    World Value Functions: Knowledge Representation for Multitask Reinforcement Learning. (arXiv:2205.08827v1 [cs.LG])
    An open problem in artificial intelligence is how to learn and represent knowledge that is sufficient for a general agent that needs to solve multiple tasks in a given world. In this work we propose world value functions (WVFs), which are a type of general value function with mastery of the world - they represent not only how to solve a given task, but also how to solve any other goal-reaching task. To achieve this, we equip the agent with an internal goal space defined as all the world states where it experiences a terminal transition - a task outcome. The agent can then modify task rewards to define its own reward function, which provably drives it to learn how to achieve all achievable internal goals, and the value of doing so in the current task. We demonstrate a number of benefits of WVFs. When the agent's internal goal space is the entire state space, we demonstrate that the transition function can be inferred from the learned WVF, which allows the agent to plan using learned value functions. Additionally, we show that for tasks in the same world, a pretrained agent that has learned any WVF can then infer the policy and value function for any new task directly from its rewards. Finally, an important property for long-lived agents is the ability to reuse existing knowledge to solve new tasks. Using WVFs as the knowledge representation for learned tasks, we show that an agent is able to solve their logical combination zero-shot, resulting in a combinatorially increasing number of skills throughout their lifetime.  ( 2 min )
    Position Aided Beam Prediction in the Real World: How Useful GPS Locations Actually Are?. (arXiv:2205.09054v1 [eess.SP])
    Millimeter-wave (mmWave) communication systems rely on narrow beams for achieving sufficient receive signal power. Adjusting these beams is typically associated with large training overhead, which becomes particularly critical for highly-mobile applications. Intuitively, since optimal beam selection can benefit from the knowledge of the positions of communication terminals, there has been increasing interest in leveraging position data to reduce the overhead in mmWave beam prediction. Prior work, however, studied this problem using only synthetic data that generally does not accurately represent real-world measurements. In this paper, we investigate position-aided beam prediction using a real-world large-scale dataset to derive insights into precisely how much overhead can be saved in practice. Furthermore, we analyze which machine learning algorithms perform best, what factors degrade inference performance in real data, and which machine learning metrics are more meaningful in capturing the actual communication system performance.  ( 2 min )
    Sharp asymptotics on the compression of two-layer neural networks. (arXiv:2205.08199v2 [cs.IT] UPDATED)
    In this paper, we study the compression of a target two-layer neural network with N nodes into a compressed network with M < N nodes. More precisely, we consider the setting in which the weights of the target network are i.i.d. sub-Gaussian, and we minimize the population L2 loss between the outputs of the target and of the compressed network, under the assumption of Gaussian inputs. By using tools from high-dimensional probability, we show that this non-convex problem can be simplified when the target network is sufficiently over-parameterized, and provide the error rate of this approximation as a function of the input dimension and N . For a ReLU activation function, we conjecture that the optimum of the simplified optimization problem is achieved by taking weights on the Equiangular Tight Frame (ETF), while the scaling of the weights and the orientation of the ETF depend on the parameters of the target network. Numerical evidence is provided to support this conjecture.
    Real-time semantic segmentation on FPGAs for autonomous vehicles with hls4ml. (arXiv:2205.07690v1 [cs.CV] CROSS LISTED)
    In this paper, we investigate how field programmable gate arrays can serve as hardware accelerators for real-time semantic segmentation tasks relevant for autonomous driving. Considering compressed versions of the ENet convolutional neural network architecture, we demonstrate a fully-on-chip deployment with a latency of 4.9 ms per image, using less than 30% of the available resources on a Xilinx ZCU102 evaluation board. The latency is reduced to 3 ms per image when increasing the batch size to ten, corresponding to the use case where the autonomous vehicle receives inputs from multiple cameras simultaneously. We show, through aggressive filter reduction and heterogeneous quantization-aware training, and an optimized implementation of convolutional layers, that the power consumption and resource utilization can be significantly reduced while maintaining accuracy on the Cityscapes dataset.
    Incorporating Prior Knowledge into Neural Networks through an Implicit Composite Kernel. (arXiv:2205.07384v2 [cs.LG] UPDATED)
    It is challenging to guide neural network (NN) learning with prior knowledge. In contrast, many known properties, such as spatial smoothness or seasonality, are straightforward to model by choosing an appropriate kernel in a Gaussian process (GP). Many deep learning applications could be enhanced by modeling such known properties. For example, convolutional neural networks (CNNs) are frequently used in remote sensing, which is subject to strong seasonal effects. We propose to blend the strengths of deep learning and the clear modeling capabilities of GPs by using a composite kernel that combines a kernel implicitly defined by a neural network with a second kernel function chosen to model known properties (e.g., seasonality). Then, we approximate the resultant GP by combining a deep network and an efficient mapping based on the Nystrom approximation, which we call Implicit Composite Kernel (ICK). ICK is flexible and can be used to include prior information in neural networks in many applications. We demonstrate the strength of our framework by showing its superior performance and flexibility on both synthetic and real-world data sets. The code is available at: https://anonymous.4open.science/r/ICK_NNGP-17C5/.
    "What makes a question inquisitive?" A Study on Type-Controlled Inquisitive Question Generation. (arXiv:2205.08056v2 [cs.CL] UPDATED)
    We propose a type-controlled framework for inquisitive question generation. We annotate an inquisitive question dataset with question types, train question type classifiers, and finetune models for type-controlled question generation. Empirical results demonstrate that we can generate a variety of questions that adhere to specific types while drawing from the source texts. We also investigate strategies for selecting a single question from a generated set, considering both an informative vs.~inquisitive question classifier and a pairwise ranker trained from a small set of expert annotations. Question selection using the pairwise ranker yields strong results in automatic and manual evaluation. Our human evaluation assesses multiple aspects of the generated questions, finding that the ranker chooses questions with the best syntax (4.59), semantics (4.37), and inquisitiveness (3.92) on a scale of 1-5, even rivaling the performance of human-written questions.
    A Comparative Analysis of Machine Learning Techniques for IoT Intrusion Detection. (arXiv:2111.13149v2 [cs.CR] CROSS LISTED)
    The digital transformation faces tremendous security challenges. In particular, the growing number of cyber-attacks targeting Internet of Things (IoT) systems restates the need for a reliable detection of malicious network activity. This paper presents a comparative analysis of supervised, unsupervised and reinforcement learning techniques on nine malware captures of the IoT-23 dataset, considering both binary and multi-class classification scenarios. The developed models consisted of Support Vector Machine (SVM), Extreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LightGBM), Isolation Forest (iForest), Local Outlier Factor (LOF) and a Deep Reinforcement Learning (DRL) model based on a Double Deep Q-Network (DDQN), adapted to the intrusion detection context. The most reliable performance was achieved by LightGBM. Nonetheless, iForest displayed good anomaly detection results and the DRL model demonstrated the possible benefits of employing this methodology to continuously improve the detection. Overall, the obtained results indicate that the analyzed techniques are well suited for IoT intrusion detection.
    GPU-accelerated partially linear multiuser detection for 5G and beyond URLLC systems. (arXiv:2201.05024v3 [eess.SP] UPDATED)
    In this feasibility study, we have implemented a recently proposed partially linear multiuser detection algorithm in reproducing kernel Hilbert spaces (RKHSs) on a GPU-accelerated platform. Partially linear multiuser detection, which combines the robustness of linear detection with the power of nonlinear methods, has been proposed for a massive connectivity scenario with the non-orthogonal multiple access (NOMA). This is a promising approach, but detecting payloads within a received orthogonal frequency division multiplexing (OFDM) radio frame requires the execution of a large number of inner product operations, which are the main computational burden of the algorithm. Although inner-product operations consist of simple kernel evaluations, their vast number poses a challenge in ultra-low latency (ULL) applications, because the time needed for computing the inner products might exceed the sub-millisecond latency requirement. To address this problem, this study demonstrates the acceleration of the inner-product operations through massive parallelization. The result is a GPU-accelerated real-time OFDM receiver that enables sub-millisecond latency detection to meet the requirements of 5th generation (5G) and beyond ultra-reliable and low latency communications (URLLC) systems. Moreover, the parallelization and acceleration techniques explored and demonstrated in this study can be extended to many other signal processing algorithms in Hilbert spaces, such as those based on projection onto convex sets (POCS) and adaptive projected subgradient method (APSM) algorithms. Experimental results and comparisons with the state-of-art confirm the effectiveness of our techniques.
    MonoTrack: Shuttle trajectory reconstruction from monocular badminton video. (arXiv:2204.01899v2 [cs.CV] UPDATED)
    Trajectory estimation is a fundamental component of racket sport analytics, as the trajectory contains information not only about the winning and losing of each point, but also how it was won or lost. In sports such as badminton, players benefit from knowing the full 3D trajectory, as the height of shuttlecock or ball provides valuable tactical information. Unfortunately, 3D reconstruction is a notoriously hard problem, and standard trajectory estimators can only track 2D pixel coordinates. In this work, we present the first complete end-to-end system for the extraction and segmentation of 3D shuttle trajectories from monocular badminton videos. Our system integrates badminton domain knowledge such as court dimension, shot placement, physical laws of motion, along with vision-based features such as player poses and shuttle tracking. We find that significant engineering efforts and model improvements are needed to make the overall system robust, and as a by-product of our work, improve state-of-the-art results on court recognition, 2D trajectory estimation, and hit recognition.
    Mingling Foresight with Imagination: Model-Based Cooperative Multi-Agent Reinforcement Learning. (arXiv:2204.09418v2 [cs.MA] UPDATED)
    Recently, model-based agents have achieved better performance than model-free ones using the same computational budget and training time in single-agent environments. However, due to the complexity of multi-agent systems, it is tough to learn the model of the environment. The significant compounding error may hinder the learning process when model-based methods are applied to multi-agent tasks. This paper proposes an implicit model-based multi-agent reinforcement learning method based on value decomposition methods. Under this method, agents can interact with the learned virtual environment and evaluate the current state value according to imagined future states in the latent space, making agents have the foresight. Our approach can be applied to any multi-agent value decomposition method. The experimental results show that our method improves the sample efficiency in different partially observable Markov decision process domains.
    Noise mitigation strategies in physical feedforward neural networks. (arXiv:2204.09461v2 [cs.NE] UPDATED)
    Physical neural networks are promising candidates for next generation artificial intelligence hardware. In such architectures, neurons and connections are physically realized and do not leverage digital concepts with their practically infinite signal-to-noise ratio to encode, transduce and transform information. They therefore are prone to noise with a variety of statistical and architectural properties, and effective strategies leveraging network-inherent assets to mitigate noise in an hardware-efficient manner are important in the pursuit of next generation neural network hardware. Based on analytical derivations, we here introduce and analyse a variety of different noise-mitigation approaches. We analytically show that intra-layer connections in which the connection matrix's squared mean exceeds the mean of its square fully suppresses uncorrelated noise. We go beyond and develop two synergistic strategies for noise that is uncorrelated and correlated across populations of neurons. First, we introduce the concept of ghost neurons, where each group of neurons perturbed by correlated noise has a negative connection to a single neuron, yet without receiving any input information. Secondly, we show that pooling of neuron populations is an efficient approach to suppress uncorrelated noise. As such, we developed a general noise mitigation strategy leveraging the statistical properties of the different noise terms most relevant in analogue hardware. Finally, we demonstrate the effectiveness of this combined approach for trained neural network classifying the MNIST handwritten digits, for which we achieve a 4-fold improvement of the output signal-to-noise ratio and increase the classification accuracy almost to the level of the noise-free network.
    Markov Abstractions for PAC Reinforcement Learning in Non-Markov Decision Processes. (arXiv:2205.01053v2 [cs.LG] UPDATED)
    Our work aims at developing reinforcement learning algorithms that do not rely on the Markov assumption. We consider the class of Non-Markov Decision Processes where histories can be abstracted into a finite set of states while preserving the dynamics. We call it a Markov abstraction since it induces a Markov Decision Process over a set of states that encode the non-Markov dynamics. This phenomenon underlies the recently introduced Regular Decision Processes (as well as POMDPs where only a finite number of belief states is reachable). In all such kinds of decision process, an agent that uses a Markov abstraction can rely on the Markov property to achieve optimal behaviour. We show that Markov abstractions can be learned during reinforcement learning. Our approach combines automata learning and classic reinforcement learning. For these two tasks, standard algorithms can be employed. We show that our approach has PAC guarantees when the employed algorithms have PAC guarantees, and we also provide an experimental evaluation.
    Bayesian Inference with Nonlinear Generative Models: Comments on Secure Learning. (arXiv:2201.09986v2 [cs.IT] UPDATED)
    Unlike the classical linear model, nonlinear generative models have been addressed sparsely in the literature. This work aims to bring attention to these models and their secrecy potential. To this end, we invoke the replica method to derive the asymptotic normalized cross entropy in an inverse probability problem whose generative model is described by a Gaussian random field with a generic covariance function. Our derivations further demonstrate the asymptotic statistical decoupling of Bayesian inference algorithms and specify the decoupled setting for a given nonlinear model. The replica solution depicts that strictly nonlinear models establish an all-or-nothing phase transition: There exists a critical load at which the optimal Bayesian inference changes from being perfect to an uncorrelated learning. This finding leads to design of a new secure coding scheme which achieves the secrecy capacity of the wiretap channel. This interesting result implies that strictly nonlinear generative models are perfectly secured without any secure coding. We justify this latter statement through the analysis of an illustrative model for perfectly secure and reliable inference.
    Practical Insights of Repairing Model Problems on Image Classification. (arXiv:2205.07116v1 [cs.LG] CROSS LISTED)
    Additional training of a deep learning model can cause negative effects on the results, turning an initially positive sample into a negative one (degradation). Such degradation is possible in real-world use cases due to the diversity of sample characteristics. That is, a set of samples is a mixture of critical ones which should not be missed and less important ones. Therefore, we cannot understand the performance by accuracy alone. While existing research aims to prevent a model degradation, insights into the related methods are needed to grasp their benefits and limitations. In this talk, we will present implications derived from a comparison of methods for reducing degradation. Especially, we formulated use cases for industrial settings in terms of arrangements of a data set. The results imply that a practitioner should care about better method continuously considering dataset availability and life cycle of an AI system because of a trade-off between accuracy and preventing degradation.
    ACReL: Adversarial Conditional value-at-risk Reinforcement Learning. (arXiv:2109.09470v2 [cs.LG] UPDATED)
    In the classical Reinforcement Learning (RL) setting, one aims to find a policy that maximizes its expected return. This objective may be inappropriate in safety-critical domains such as healthcare or autonomous driving, where intrinsic uncertainties due to stochastic policies and environment variability may lead to catastrophic failures. This can be addressed by using the Conditional-Value-at-Risk (CVaR) objective to instill risk-aversion in learned policies. In this paper, we propose Adversarial Cvar Reinforcement Learning (ACReL), a novel adversarial meta-algorithm to optimize the CVaR objective in RL. ACReL is based on a max-min between a policy player and a learned adversary that perturbs the policy player's state transitions given a finite budget. We prove that, the closer the players are to the game's equilibrium point, the closer the learned policy is to the CVaR-optimal one with a risk tolerance explicitly related to the adversary's budget. We provide a gradient-based training procedure to solve the proposed game by formulating it as a Stackelberg game, enabling the use of deep RL architectures and training algorithms. Empirical experiments show that ACReL matches a CVaR RL state-of-the-art baseline for retrieving CVaR optimal policies, while also benefiting from theoretical guarantees.
    FastCover: An Unsupervised Learning Framework for Multi-Hop Influence Maximization in Social Networks. (arXiv:2111.00463v2 [cs.SI] UPDATED)
    Finding influential users in social networks is a fundamental problem with many possible useful applications. Viewing the social network as a graph, the influence of a set of users can be measured by the number of neighbors located within a given number of hops in the network, where each hop marks a step of influence diffusion. In this paper, we reduce the problem of IM to a budget-constrained d-hop dominating set problem (kdDSP). We propose a unified machine learning (ML) framework, FastCover, to solve kdDSP by learning an efficient greedy strategy in an unsupervised way. As one critical component of the framework, we devise a novel graph neural network (GNN) architecture, graph reversed attention network (GRAT), that captures the diffusion process among neighbors. Unlike most heuristic algorithms and concurrent ML frameworks for combinatorial optimization problems, FastCover determines the entire seed set from the nodes' scores computed with only one forward propagation of the GNN and has a time complexity quasi-linear in the graph size. Experiments on synthetic graphs and real-world social networks demonstrate that FastCover finds solutions with better or comparable quality rendered by the concurrent algorithms while achieving a speedup of over 1000x.
    SLISEMAP: Supervised dimensionality reduction through local explanations. (arXiv:2201.04455v2 [cs.LG] UPDATED)
    Existing methods for explaining black box learning models often focus on building local explanations of model behaviour for a particular data item. It is possible to create global explanations for all data items, but these explanations generally have low fidelity for complex black box models. We propose a new supervised manifold visualisation method, SLISEMAP, that simultaneously finds local explanations for all data items and builds a (typically) two-dimensional global visualisation of the black box model such that data items with similar local explanations are projected nearby. We provide a mathematical derivation of our problem and an open source implementation implemented using the GPU-optimised PyTorch library. We compare SLISEMAP to multiple popular dimensionality reduction methods and find that SLISEMAP is able to utilise labelled data to create embeddings with consistent local white box models. We also compare SLISEMAP to other model-agnostic local explanation methods and show that SLISEMAP provides comparable explanations and that the visualisations can give a broader understanding of black box regression and classification models.
    Preserving Privacy and Security in Federated Learning. (arXiv:2202.03402v2 [cs.LG] UPDATED)
    Federated learning is known to be vulnerable to both security and privacy issues. Existing research has focused either on preventing poisoning attacks from users or on concealing the local model updates from the server, but not both. However, integrating these two lines of research remains a crucial challenge since they often conflict with one another with respect to the threat model. In this work, we develop a principle framework that offers both privacy guarantees for users and detection against poisoning attacks from them. With a new threat model that includes both an honest-but-curious server and malicious users, we first propose a secure aggregation protocol using homomorphic encryption for the server to combine local model updates in a private manner. Then, a zero-knowledge proof protocol is leveraged to shift the task of detecting attacks in the local models from the server to the users. The key observation here is that the server no longer needs access to the local models for attack detection. Therefore, our framework enables the central server to identify poisoned model updates without violating the privacy guarantees of secure aggregation.
    Probing Pretrained Models of Source Code. (arXiv:2202.08975v2 [cs.SE] UPDATED)
    Deep learning models are widely used for solving challenging code processing tasks, such as code generation or code summarization. Traditionally, a specific model architecture was carefully built to solve a particular code processing task. However, recently general pretrained models such as CodeBERT or CodeT5 have been shown to outperform task-specific models in many applications. While pretrained models are known to learn complex patterns from data, they may fail to understand some properties of source code. To test diverse aspects of code understanding, we introduce a set of diagnosting probing tasks. We show that pretrained models of code indeed contain information about code syntactic structure and correctness, the notions of identifiers, data flow and namespaces, and natural language naming. We also investigate how probing results are affected by using code-specific pretraining objectives, varying the model size, or finetuning.
    Molformer: Motif-based Transformer on 3D Heterogeneous Molecular Graphs. (arXiv:2110.01191v5 [q-bio.QM] UPDATED)
    Procuring expressive molecular representations underpins AI-driven molecule design and scientific discovery. The research to date mainly focuses on atom-level homogeneous molecular graphs, ignoring the rich information in subgraphs or motifs. However, it has been widely accepted that substructures play a dominant role in the identification and determination of molecular properties. To address such issues, we formulate heterogeneous molecular graphs (HMGs), and introduce Molformer to exploit both molecular motifs and 3D geometry. Specifically, we extract functional groups as motifs for small molecules and resort to the reinforcement learning to adaptively select quaternary amino acids as motifs for proteins. Then HMGs are constructed with both atom-level and motif-level nodes. To better accommodate those HMGs, we introduce a variant of Transformer named Molformer, which adopts a heterogeneous self-attention layer to distinguish the interactions between multi-level nodes. Besides, it is also coupled with a multi-scale mechanism to capture local fine-grained patterns with increasing contextual scales. An attentive farthest point sampling algorithm is also proposed to obtain the molecular representations. We validate Molformer across a few domains including quantum chemistry, physiology, and biophysics. Experiments show that Molformer outperforms state-of-the-art baselines. Our work provides a promising way to utilize informative motifs from the perspective of multi-level graph construction.
    HAVEN: Hierarchical Cooperative Multi-Agent Reinforcement Learning with Dual Coordination Mechanism. (arXiv:2110.07246v2 [cs.MA] UPDATED)
    Multi-agent reinforcement learning often suffers from the exponentially large action space caused by a large number of agents. This paper proposes a novel value decomposition framework HAVEN based on hierarchical reinforcement learning for the fully cooperative multi-agent problems. To address the instability that arises from the concurrent optimization of high-level and low-level policies and another concurrent optimization of agents, we introduce the dual coordination mechanism of inter-layer strategies and inter-agent strategies. HAVEN does not require domain knowledge and pretraining, and can be applied to any value decomposition variant. Our method achieves desirable results on different decentralized partially observable Markov decision process domains and offers an efficient solution to partition the decision space implicitly.
    Doubly Robust Collaborative Targeted Learning for Debiased Recommendations. (arXiv:2203.10258v2 [cs.IR] UPDATED)
    In recommender systems, the collected data always contains various biases and leads to the challenge of accurate predictions. To address selection bias and confounding bias, the doubly robust (DR) method and its variants show superior performance due to the double robustness property and smaller bias under inaccurate propensity and error imputation models. However, we theoretically show that the variance of the error imputation-based (EIB) method is much smaller than that of DR, although EIB may suffer from a much larger bias. In this paper, we propose a doubly robust targeted learning method that effectively combines the small-bias property of DR and the small-variance property of EIB, by leveraging the targeted maximum likelihood estimation technique. Theoretical analysis shows that the proposed targeted learning is effective in reducing the variance of DR while maintaining double robustness. To further reduce the bias and variance during the training process, we propose a novel collaborative targeted learning approach that decomposes imputed errors into parametric and nonparametric parts and updates them collaboratively, resulting in more accurate predictions. Both theoretical analysis and experiments demonstrate the superiority of the proposed methods compared with existing debiasing methods.
    Predicting Berth Stay for Tanker Terminals: A Systematic and Dynamic Approach. (arXiv:2204.04085v2 [cs.CE] UPDATED)
    Given the trend of digitization and increasing number of maritime transport, prediction of vessel berth stay has been triggered for requirements of operation research and scheduling optimization problem in the era of maritime big data, which takes a significant part in port efficiency and maritime logistics enhancement. This study proposes a systematic and dynamic approach of predicting berth stay for tanker terminals. The approach covers three innovative aspects: 1) Data source employed is multi-faceted, including cargo operation data from tanker terminals, time-series data from automatic identification system (AIS), etc. 2) The process of berth stay is decomposed into multiple blocks according to data analysis and information extraction innovatively, and practical operation scenarios are also developed accordingly. 3) The predictive models of berth stay are developed on the basis of prior data analysis and information extraction under two methods, including regression and decomposed distribution. The models are evaluated under four dynamic scenarios with certain designated cargoes among two different terminals. The evaluation results show that the proposed approach can predict berth stay with the accuracy up to 98.81% validated by historical baselines, and also demonstrate the proposed approach has dynamic capability of predicting berth stay among the scenarios. The model may be potentially applied for short-term pilot-booking or scheduling optimizations within a reasonable time frame for advancement of port intelligence and logistics efficiency.
    Decentral and Incentivized Federated Learning Frameworks: A Systematic Literature Review. (arXiv:2205.07855v2 [cs.LG] UPDATED)
    The advent of Federated Learning (FL) has ignited a new paradigm for parallel and confidential decentralized Machine Learning (ML) with the potential of utilizing the computational power of a vast number of IoT, mobile and edge devices without data leaving the respective device, ensuring privacy by design. Yet, in order to scale this new paradigm beyond small groups of already entrusted entities towards mass adoption, the Federated Learning Framework (FLF) has to become (i) truly decentralized and (ii) participants have to be incentivized. This is the first systematic literature review analyzing holistic FLFs in the domain of both, decentralized and incentivized federated learning. 422 publications were retrieved, by querying 12 major scientific databases. Finally, 40 articles remained after a systematic review and filtering process for in-depth examination. Although having massive potential to direct the future of a more distributed and secure AI, none of the analyzed FLF is production-ready. The approaches vary heavily in terms of use-cases, system design, solved issues and thoroughness. We are the first to provide a systematic approach to classify and quantify differences between FLF, exposing limitations of current works and derive future directions for research in this novel domain.
    You Only Cut Once: Boosting Data Augmentation with a Single Cut. (arXiv:2201.12078v2 [cs.CV] UPDATED)
    We present You Only Cut Once (YOCO) for performing data augmentations. YOCO cuts one image into two pieces and performs data augmentations individually within each piece. Applying YOCO improves the diversity of the augmentation per sample and encourages neural networks to recognize objects from partial information. YOCO enjoys the properties of parameter-free, easy usage, and boosting almost all augmentations for free. Thorough experiments are conducted to evaluate its effectiveness. We first demonstrate that YOCO can be seamlessly applied to varying data augmentations, neural network architectures, and brings performance gains on CIFAR and ImageNet classification tasks, sometimes surpassing conventional image-level augmentation by large margins. Moreover, we show YOCO benefits contrastive pre-training toward a more powerful representation that can be better transferred to multiple downstream tasks. Finally, we study a number of variants of YOCO and empirically analyze the performance for respective settings. Code is available at GitHub.
    Exploring Deep Reinforcement Learning-Assisted Federated Learning for Online Resource Allocation in Privacy-Persevering EdgeIoT. (arXiv:2202.07391v2 [cs.LG] UPDATED)
    Federated learning (FL) has been increasingly considered to preserve data training privacy from eavesdropping attacks in mobile edge computing-based Internet of Thing (EdgeIoT). On the one hand, the learning accuracy of FL can be improved by selecting the IoT devices with large datasets for training, which gives rise to a higher energy consumption. On the other hand, the energy consumption can be reduced by selecting the IoT devices with small datasets for FL, resulting in a falling learning accuracy. In this paper, we formulate a new resource allocation problem for privacy-persevering EdgeIoT to balance the learning accuracy of FL and the energy consumption of the IoT device. We propose a new federated learning-enabled twin-delayed deep deterministic policy gradient (FL-DLT3) framework to achieve the optimal accuracy and energy balance in a continuous domain. Furthermore, long short term memory (LSTM) is leveraged in FL-DLT3 to predict the time-varying network state while FL-DLT3 is trained to select the IoT devices and allocate the transmit power. Numerical results demonstrate that the proposed FL-DLT3 achieves fast convergence (less than 100 iterations) while the FL accuracy-to-energy consumption ratio is improved by 51.8% compared to existing state-of-the-art benchmark.
    Ranking of Communities in Multiplex Spatiotemporal Models of Brain Dynamics. (arXiv:2203.09281v2 [q-bio.NC] UPDATED)
    As a relatively new field, network neuroscience has tended to focus on aggregate behaviours of the brain averaged over many successive experiments or over long recordings in order to construct robust brain models. These models are limited in their ability to explain dynamic state changes in the brain which occurs spontaneously as a result of normal brain function. Hidden Markov Models (HMMs) trained on neuroimaging time series data have since arisen as a method to produce dynamical models that are easy to train but can be difficult to fully parametrise or analyse. We propose an interpretation of these neural HMMs as multiplex brain state graph models we term Hidden Markov Graph Models (HMGMs). This interpretation allows for dynamic brain activity to be analysed using the full repertoire of network analysis techniques. Furthermore, we propose a general method for selecting HMM hyperparameters in the absence of external data, based on the principle of maximum entropy, and use this to select the number of layers in the multiplex model. We produce a new tool for determining important communities of brain regions using a spatiotemporal random walk-based procedure that takes advantage of the underlying Markov structure of the model. Our analysis of real multi-subject fMRI data provides new results that corroborate the modular processing hypothesis of the brain at rest as well as contributing new evidence of functional overlap between and within dynamic brain state communities. Our analysis pipeline provides a way to characterise dynamic network activity of the brain under novel behaviours or conditions.
    CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation. (arXiv:2103.06874v4 [cs.CL] UPDATED)
    Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model's ability to adapt. In this paper, we present CANINE, a neural encoder that operates directly on character sequences, without explicit tokenization or vocabulary, and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, CANINE combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. CANINE outperforms a comparable mBERT model by 2.8 F1 on TyDi QA, a challenging multilingual benchmark, despite having 28% fewer model parameters.
    Describing Differences between Text Distributions with Natural Language. (arXiv:2201.12323v2 [cs.CL] UPDATED)
    How do two distributions of texts differ? Humans are slow at answering this, since discovering patterns might require tediously reading through hundreds of samples. We propose to automatically summarize the differences by "learning a natural language hypothesis": given two distributions $D_{0}$ and $D_{1}$, we search for a description that is more often true for $D_{1}$, e.g., "is military-related." To tackle this problem, we fine-tune GPT-3 to propose descriptions with the prompt: "[samples of $D_{0}$] + [samples of $D_{1}$] + the difference between them is_____." We then re-rank the descriptions by checking how often they hold on a larger set of samples with a learned verifier. On a benchmark of 54 real-world binary classification tasks, while GPT-3 Curie (13B) only generates a description similar to human annotation 7% of the time, the performance reaches 61% with fine-tuning and re-ranking, and our best system using GPT-3 Davinci (175B) reaches 76%. We apply our system to describe distribution shifts, debug dataset shortcuts, summarize unknown tasks, and label text clusters, and present analyses based on automatically generated descriptions.
    PFGE: Parsimonious Fast Geometric Ensembling of DNNs. (arXiv:2202.06658v5 [cs.LG] UPDATED)
    Ensemble methods have been widely used to improve the performance of machine learning methods in terms of generalization, while they are hard to use in deep learning systems, as training an ensemble of deep neural networks (DNNs) incurs an extremely higher computational overhead of model training. Recently, advanced techniques such as fast geometric ensembling (FGE) and snapshot ensemble have been proposed. These methods can train the model ensembles in the same time as a single model, thus getting around the hurdle of training time. However, their memory overhead for test-time inference remains much higher than single model based methods. Here we propose a parsimonious FGE (PFGE) that employs a lightweight ensemble of higher-performing DNNs, generated by successively-performed stochastic weight averaging procedures. Experimental results across different advanced DNN architectures on benchmark datasets CIFAR-$\{10,100\}$ and Imagenet, demonstrate that PFGE matches the state-of-the-art FGE method in terms of the generalization error, yet requires only 20% memory overhead for test-time inference. Our code is available at https://github.com/ZJLAB-AMMI/PFGE.
    Finite-Sum Coupled Compositional Stochastic Optimization: Theory and Applications. (arXiv:2202.12396v3 [math.OC] UPDATED)
    This paper studies stochastic optimization for a sum of compositional functions, where the inner-level function of each summand is coupled with the corresponding summation index. We refer to this family of problems as finite-sum coupled compositional optimization (FCCO). It has broad applications in machine learning for optimizing non-convex or convex compositional measures/objectives such as average precision (AP), p-norm push, listwise ranking losses, neighborhood component analysis (NCA), deep survival analysis, and deep latent variable models, which deserves finer analysis. Yet, existing algorithms and analyses are restricted in one or other aspects. The contribution of this paper is to provide a comprehensive analysis of a simple stochastic algorithm for both non-convex and convex objectives. The key results are improved oracle complexities with the parallel speed-up by the moving-average based stochastic estimator with mini-batching. Our theoretical analysis also exhibits new insights for improving the practical implementation by sampling the batches of equal size for the outer and inner levels. Numerical experiments on AP maximization, NCA, and p-norm push optimization corroborate some aspects of the theory.
    A simple yet effective baseline for non-attributed graph classification. (arXiv:1811.03508v3 [cs.LG] UPDATED)
    Graphs are complex objects that do not lend themselves easily to typical learning tasks. Recently, a range of approaches based on graph kernels or graph neural networks have been developed for graph classification and for representation learning on graphs in general. As the developed methodologies become more sophisticated, it is important to understand which components of the increasingly complex methods are necessary or most effective. As a first step, we develop a simple yet meaningful graph representation, and explore its effectiveness in graph classification. We test our baseline representation for the graph classification task on a range of graph datasets. Interestingly, this simple representation achieves similar performance as the state-of-the-art graph kernels and graph neural networks for non-attributed graph classification. Its performance on classifying attributed graphs is slightly weaker as it does not incorporate attributes. However, given its simplicity and efficiency, we believe that it still serves as an effective baseline for attributed graph classification. Our graph representation is efficient (linear-time) to compute. We also provide a simple connection with the graph neural networks. Note that these observations are only for the task of graph classification while existing methods are often designed for a broader scope including node embedding and link prediction. The results are also likely biased due to the limited amount of benchmark datasets available. Nevertheless, the good performance of our simple baseline calls for the development of new, more comprehensive benchmark datasets so as to better evaluate and analyze different graph learning methods. Furthermore, given the computational efficiency of our graph summary, we believe that it is a good candidate as a baseline method for future graph classification (or even other graph learning) studies.
    A Scalable AutoML Approach Based on Graph Neural Networks. (arXiv:2111.00083v3 [cs.LG] UPDATED)
    AutoML systems build machine learning models automatically by performing a search over valid data transformations and learners, along with hyper-parameter optimization for each learner. Many AutoML systems use meta-learning to guide search for optimal pipelines. In this work, we present a novel meta-learning system called KGpip which, (1) builds a database of datasets and corresponding pipelines by mining thousands of scripts with program analysis, (2) uses dataset embeddings to find similar datasets in the database based on its content instead of metadata-based features, (3) models AutoML pipeline creation as a graph generation problem, to succinctly characterize the diverse pipelines seen for a single dataset. KGpip's meta-learning is a sub-component for AutoML systems. We demonstrate this by integrating KGpip with two AutoML systems. Our comprehensive evaluation using 126 datasets, including those used by the state-of-the-art systems, shows that KGpip significantly outperforms these systems.
    Model-based Clustering with Missing Not At Random Data. (arXiv:2112.10425v2 [stat.ML] UPDATED)
    Traditional ways for handling missing values are not designed for the clustering purpose and they rarely apply to the general case, though frequent in practice, of Missing Not At Random (MNAR) values. This paper proposes to embed MNAR data directly within model-based clustering algorithms. We introduce a mixture model for different types of data (continuous, count, categorical and mixed) to jointly model the data distribution and the MNAR mechanism. Eight different MNAR models are proposed, which may depend on the underlying (unknown) classes and/or the values of the missing variables themselves. We prove the identifiability of the parameters of both the data distribution and the mechanism, whatever the type of data and the mechanism, and propose an EM or Stochastic EM algorithm to estimate them. The code is available on \url{https://github.com/AudeSportisse/Clustering-MNAR}. %\url{https://anonymous.4open.science/r/Clustering-MNAR-0201} We also prove that MNAR models for which the missingness depends on the class membership have the nice property that the statistical inference can be carried out on the data matrix concatenated with the mask by considering a MAR mechanism instead. Finally, we perform empirical evaluations for the proposed sub-models on synthetic data and we illustrate the relevance of our method on a medical register, the TraumaBase$^{\mbox{\normalsize{\textregistered}}}$ dataset.
    Shape complexity in cluster analysis. (arXiv:2205.08046v2 [cs.LG] UPDATED)
    In cluster analysis, a common first step is to scale the data aiming to better partition them into clusters. Even though many different techniques have throughout many years been introduced to this end, it is probably fair to say that the workhorse in this preprocessing phase has been to divide the data by the standard deviation along each dimension. Like division by the standard deviation, the great majority of scaling techniques can be said to have roots in some sort of statistical take on the data. Here we explore the use of multidimensional shapes of data, aiming to obtain scaling factors for use prior to clustering by some method, like k-means, that makes explicit use of distances between samples. We borrow from the field of cosmology and related areas the recently introduced notion of shape complexity, which in the variant we use is a relatively simple, data-dependent nonlinear function that we show can be used to help with the determination of appropriate scaling factors. Focusing on what might be called "midrange" distances, we formulate a constrained nonlinear programming problem and use it to produce candidate scaling-factor sets that can be sifted on the basis of further considerations of the data, say via expert knowledge. We give results on some iconic data sets, highlighting the strengths and potential weaknesses of the new approach. These results are generally positive across all the data sets used.
    Translatotron 2: High-quality direct speech-to-speech translation with voice preservation. (arXiv:2107.08661v5 [cs.CL] UPDATED)
    We present Translatotron 2, a neural direct speech-to-speech translation model that can be trained end-to-end. Translatotron 2 consists of a speech encoder, a linguistic decoder, an acoustic synthesizer, and a single attention module that connects them together. Experimental results on three datasets consistently show that Translatotron 2 outperforms the original Translatotron by a large margin on both translation quality (up to +15.5 BLEU) and speech generation quality, and approaches the same of cascade systems. In addition, we propose a simple method for preserving speakers' voices from the source speech to the translation speech in a different language. Unlike existing approaches, the proposed method is able to preserve each speaker's voice on speaker turns without requiring for speaker segmentation. Furthermore, compared to existing approaches, it better preserves speaker's privacy and mitigates potential misuse of voice cloning for creating spoofing audio artifacts.
    Variational autoencoders in the presence of low-dimensional data: landscape and implicit bias. (arXiv:2112.06868v2 [cs.LG] UPDATED)
    Variational Autoencoders are one of the most commonly used generative models, particularly for image data. A prominent difficulty in training VAEs is data that is supported on a lower-dimensional manifold. Recent work by Dai and Wipf (2020) proposes a two-stage training algorithm for VAEs, based on a conjecture that in standard VAE training the generator will converge to a solution with 0 variance which is correctly supported on the ground truth manifold. They gave partial support for that conjecture by showing that some optima of the VAE loss do satisfy this property, but did not analyze the training dynamics. In this paper, we show that for linear encoders/decoders, the conjecture is true-that is the VAE training does recover a generator with support equal to the ground truth manifold-and does so due to an implicit bias of gradient descent rather than merely the VAE loss itself. In the nonlinear case, we show that VAE training frequently learns a higher-dimensional manifold which is a superset of the ground truth manifold.
    Detecting Model Misspecification in Amortized Bayesian Inference with Neural Networks. (arXiv:2112.08866v3 [stat.ME] UPDATED)
    Recent advances in probabilistic deep learning enable amortized Bayesian inference in settings where the likelihood function is implicitly defined by a simulation program. But how faithful is such inference when simulations represent reality somewhat inaccurately? In this paper, we conceptualize the types of model misspecification arising in simulation-based inference and systematically investigate the performance of SNPE-C (APT) and the BayesFlow framework under these misspecifications. We propose an augmented optimization objective which imposes a probabilistic structure on the learned latent data summary space and utilize maximum mean discrepancy (MMD) to detect potentially catastrophic misspecifications during inference undermining the validity of the obtained results. We verify our detection criterion on a number of artificial and realistic misspecifications, ranging from toy conjugate models to complex models of decision making and disease outbreak dynamics applied to real data. Further, we show that posterior inference errors increase when the distance between the latent summary distributions of the true data-generating process and the training simulations grows. Thus, we demonstrate the dual utility of MMD as a method for detecting model misspecification and as a proxy for verifying the faithfulness of amortized simulation-based Bayesian inference.
    A label efficient two-sample test. (arXiv:2111.08861v3 [cs.LG] UPDATED)
    Two-sample tests evaluate whether two samples are realizations of the same distribution (the null hypothesis) or two different distributions (the alternative hypothesis). We consider a new setting for this problem where sample features are easily measured whereas sample labels are unknown and costly to obtain. Accordingly, we devise a three-stage framework in service of performing an effective two-sample test with only a small number of sample label queries: first, a classifier is trained with samples uniformly labeled to model the posterior probabilities of the labels; second, a novel query scheme dubbed \emph{bimodal query} is used to query labels of samples from both classes, and last, the classical Friedman-Rafsky (FR) two-sample test is performed on the queried samples. Theoretical analysis and extensive experiments performed on several datasets demonstrate that the proposed test controls the Type I error and has decreased Type II error relative to uniform querying and certainty-based querying. Source code for our algorithms and experimental results is available at \url{https://github.com/wayne0908/Label-Efficient-Two-Sample}.
    Leveraging Global Binary Masks for Structure Segmentation in Medical Images. (arXiv:2205.09107v1 [eess.IV])
    Deep learning (DL) models for medical image segmentation are highly influenced by intensity variations of input images and lack generalization due to primarily utilizing pixels' intensity information for inference. Acquiring sufficient training data is another challenge limiting models' applications. We proposed to leverage the consistency of organs' anatomical shape and position information in medical images. We introduced a framework leveraging recurring anatomical patterns through global binary masks for organ segmentation. Two scenarios were studied.1) Global binary masks were the only model's (i.e. U-Net) input, forcing exclusively encoding organs' position and shape information for segmentation/localization.2) Global binary masks were incorporated as an additional channel functioning as position/shape clues to mitigate training data scarcity. Two datasets of the brain and heart CT images with their ground-truth were split into (26:10:10) and (12:3:5) for training, validation, and test respectively. Training exclusively on global binary masks led to Dice scores of 0.77(0.06) and 0.85(0.04), with the average Euclidian distance of 3.12(1.43)mm and 2.5(0.93)mm relative to the center of mass of the ground truth for the brain and heart structures respectively. The outcomes indicate that a surprising degree of position and shape information is encoded through global binary masks. Incorporating global binary masks led to significantly higher accuracy relative to the model trained on only CT images in small subsets of training data; the performance improved by 4.3-125.3% and 1.3-48.1% for 1-8 training cases of the brain and heart datasets respectively. The findings imply the advantages of utilizing global binary masks for building generalizable models and to compensate for training data scarcity.
    Dynamic Predictions of Postoperative Complications from Explainable, Uncertainty-Aware, and Multi-Task Deep Neural Networks. (arXiv:2004.12551v2 [cs.LG] UPDATED)
    Accurate prediction of postoperative complications can inform shared decisions regarding prognosis, preoperative risk-reduction, and postoperative resource use. We hypothesized that multi-task deep learning models would outperform random forest models in predicting postoperative complications, and that integrating high-resolution intraoperative physiological time series would result in more granular and personalized health representations that would improve prognostication compared to preoperative predictions. In a longitudinal cohort study of 56,242 patients undergoing 67,481 inpatient surgical procedures at a university medical center, we compared deep learning models with random forests for predicting nine common postoperative complications using preoperative, intraoperative, and perioperative patient data. Our study indicated several significant results across experimental settings that suggest the utility of deep learning for capturing more precise representations of patient health for augmented surgical decision support. Multi-task learning improved efficiency by reducing computational resources without compromising predictive performance. Integrated gradients interpretability mechanisms identified potentially modifiable risk factors for each complication. Monte Carlo dropout methods provided a quantitative measure of prediction uncertainty that has the potential to enhance clinical trust. Multi-task learning, interpretability mechanisms, and uncertainty metrics demonstrated potential to facilitate effective clinical implementation.
    Detecting micro fractures: A comprehensive comparison of conventional and machine-learning based segmentation methods. (arXiv:2103.12821v2 [cs.LG] UPDATED)
    Studying porous rock materials with X-Ray Computed Tomography (XRCT) has been established as a standard procedure for the non-destructive visualization of flow and transport in opaque porous media. Despite the recent advances in the field of XRCT, some challenges still remain due to the inherent noise and imaging artefacts in the produced data. These issues become even more profound when the objective is the identification of fractures, and/or fracture networks. The challenge is the limited contrast between the regions of interest and the neighboring areas. This limited contrast can mostly be attributed to the minute aperture of the fractures. In order to overcome this challenge, it has been a common approach to apply digital image processing, such as filtering, to enhance the signal-to-noise ratio. Additionally, segmentation methods based on threshold-/morphology schemes can be employed to obtain enhanced information from the features of interest. However, this workflow needs a skillful operator to fine-tune its input parameters, and the required computation time significantly increases due to the complexity of the available methods, and the large volume of the data-set. In this study, based on a data-set produced by the successful visualization of a fracture network in Carrara marble with XRCT, we present the segmentation results from a number of segmentation methods. Three conventional and two machine-learning-based methods are evaluated. The segmentation results from all five methods are compared to each other in terms of segmentation quality and time efficiency. Due to memory limitations, and in order to accomplish a fair comparison, all the methods are employed in a 2D scheme. The output of the 2D U-net model, which is one of the adopted machine-learning-based segmentation methods, shows the best performance regarding the quality of segmentation and the required processing time.
    Finite-Bit Quantization For Distributed Algorithms With Linear Convergence. (arXiv:2107.11304v3 [math.OC] UPDATED)
    This paper studies distributed algorithms for (strongly convex) composite optimization problems over mesh networks, subject to quantized communications. Instead of focusing on a specific algorithmic design, a black-box model is proposed, casting linearly convergent distributed algorithms in the form of fixed-point iterates. The algorithmic model is equipped with a novel random or deterministic Biased Compression (BC) rule on the quantizer design, and a new Adaptive encoding Nonuniform Quantizer (ANQ) coupled with a communication-efficient encoding scheme, which implements the BC-rule using a finite number of bits (below machine precision). This fills a gap existing in most state-of-the-art quantization schemes, such as those based on the popular compression rule, which rely on communication of some scalar signals with negligible quantization error (in practice quantized at the machine precision). A unified communication complexity analysis is developed for the black-box model, determining the average number of bits required to reach a solution of the optimization problem within a target accuracy. It is shown that the proposed BC-rule preserves linear convergence of the unquantized algorithms, and a trade-off between convergence rate and communication cost under ANQ-based quantization is characterized. Numerical results validate our theoretical findings and show that distributed algorithms equipped with the proposed ANQ have more favorable communication cost than algorithms using state-of-the-art quantization rules.
    Optimizing Operating Points for High Performance Lesion Detection and Segmentation Using Lesion Size Reweighting. (arXiv:2107.12978v2 [eess.IV] UPDATED)
    There are many clinical contexts which require accurate detection and segmentation of all focal pathologies (e.g. lesions, tumours) in patient images. In cases where there are a mix of small and large lesions, standard binary cross entropy loss will result in better segmentation of large lesions at the expense of missing small ones. Adjusting the operating point to accurately detect all lesions generally leads to oversegmentation of large lesions. In this work, we propose a novel reweighing strategy to eliminate this performance gap, increasing small pathology detection performance while maintaining segmentation accuracy. We show that our reweighing strategy vastly outperforms competing strategies based on experiments on a large scale, multi-scanner, multi-center dataset of Multiple Sclerosis patient images.
    Testing the Robustness of a BiLSTM-based Structural Story Classifier. (arXiv:2201.02733v2 [cs.CL] UPDATED)
    The growing prevalence of counterfeit stories on the internet has fostered significant interest towards fast and scalable detection of fake news in the machine learning community. While several machine learning techniques for this purpose have emerged, we observe that there is a need to evaluate the impact of noise on these techniques' performance, where noise constitutes news articles being mistakenly labeled as fake (or real). This work takes a step in that direction, where we examine the impact of noise on a state-of-the-art, structural model based on BiLSTM (Bidirectional Long-Short Term Model) for fake news detection, Hierarchical Discourse-level Structure for Fake News Detection by Karimi and Tang (Reference no. 9).
    Linear Speedup in Personalized Collaborative Learning. (arXiv:2111.05968v3 [cs.LG] UPDATED)
    Collaborative training can improve the accuracy of a model for a user by trading off the model's bias (introduced by using data from other users who are potentially different) against its variance (due to the limited amount of data on any single user). In this work, we formalize the personalized collaborative learning problem as a stochastic optimization of a task $0$ while given access to $N$ related but different tasks $1,\dots, N$. We give convergence guarantees for two algorithms in this setting -- a popular collaboration method known as \emph{weighted gradient averaging}, and a novel \emph{bias correction} method -- and explore conditions under which we can achieve linear speedup w.r.t. the number of auxiliary tasks $N$. Further, we also empirically study their performance confirming our theoretical insights.
    Learning Selective Sensor Fusion for States Estimation. (arXiv:1912.13077v2 [cs.CV] UPDATED)
    Autonomous vehicles and mobile robotic systems are typically equipped with multiple sensors to provide redundancy. By integrating the observations from different sensors, these mobile agents are able to perceive the environment and estimate system states, e.g. locations and orientations. Although deep learning approaches for multimodal odometry estimation and localization have gained traction, they rarely focus on the issue of robust sensor fusion - a necessary consideration to deal with noisy or incomplete sensor observations in the real world. Moreover, current deep odometry models suffer from a lack of interpretability. To this extent, we propose SelectFusion, an end-to-end selective sensor fusion module which can be applied to useful pairs of sensor modalities such as monocular images and inertial measurements, depth images and LIDAR point clouds. Our model is a uniform framework that is not restricted to specific modality or task. During prediction, the network is able to assess the reliability of the latent features from different sensor modalities and estimate trajectory both at scale and global pose. In particular, we propose two fusion modules - a deterministic soft fusion and a stochastic hard fusion, and offer a comprehensive study of the new strategies compared to trivial direct fusion. We extensively evaluate all fusion strategies in both public datasets and on progressively degraded datasets that present synthetic occlusions, noisy and missing data and time misalignment between sensors, and we investigate the effectiveness of the different fusion strategies in attending the most reliable features, which in itself, provides insights into the operation of the various models.
    Increasing-Margin Adversarial (IMA) Training to Improve Adversarial Robustness of Neural Networks. (arXiv:2005.09147v8 [cs.CV] UPDATED)
    Deep neural networks (DNNs) are vulnerable to adversarial noises. By adding adversarial noises to training samples, adversarial training can improve the model's robustness against adversarial noises. However, adversarial training samples with excessive noises can harm standard accuracy, which may be unacceptable for many medical image analysis applications. This issue has been termed the trade-off between standard accuracy and adversarial robustness. In this paper, we hypothesize that this issue may be alleviated if the adversarial samples for training are placed right on the decision boundaries. Based on this hypothesis, we design an adaptive adversarial training method, named IMA. For each individual training sample, IMA makes a sample-wise estimation of the upper bound of the adversarial perturbation. In the training process, each of the sample-wise adversarial perturbations is gradually increased to match the margin. Once an equilibrium state is reached, the adversarial perturbations will stop increasing. IMA is evaluated on publicly available datasets under two popular adversarial attacks, PGD and IFGSM. The results show that: (1) IMA significantly improves adversarial robustness of DNN classifiers, which achieves the state-of-the-art performance; (2) IMA has a minimal reduction in clean accuracy among all competing defense methods; (3) IMA can be applied to pretrained models to reduce time cost; (4) IMA can be applied to the state-of-the-art medical image segmentation networks, with outstanding performance. We hope our work may help to lift the trade-off between adversarial robustness and clean accuracy and facilitate the development of robust applications in the medical field. The source code will be released when this paper is published.
    Physics-informed Guided Disentanglement in Generative Networks. (arXiv:2107.14229v2 [cs.CV] UPDATED)
    Image-to-image translation (i2i) networks suffer from entanglement effects in presence of physics-related phenomena in target domain (such as occlusions, fog, etc), lowering altogether the translation quality, controllability and variability. In this paper, we build upon collection of simple physics models and present a comprehensive method for disentangling visual traits in target images, guiding the process with a physical model that renders some of the target traits, and learning the remaining ones. Because it allows explicit and interpretable outputs, our physical models (optimally regressed on target) allows generating unseen scenarios in a controllable manner. We also extend our framework, showing versatility to neural-guided disentanglement. The results show our disentanglement strategies dramatically increase performances qualitatively and quantitatively in several challenging scenarios for image translation.
    Link Scheduling using Graph Neural Networks. (arXiv:2109.05536v2 [eess.SP] UPDATED)
    Efficient scheduling of transmissions is a key problem in wireless networks. The main challenge stems from the fact that optimal link scheduling involves solving a maximum weighted independent set (MWIS) problem, which is known to be NP-hard. In practical schedulers, centralized and distributed greedy heuristics are commonly used to approximately solve the MWIS problem. However, these greedy heuristics mostly ignore important topological information of the wireless network. To overcome this limitation, we propose fast heuristics based on graph convolutional networks (GCNs) that can be implemented in centralized and distributed manners. Our centralized heuristic is based on tree search guided by a GCN and 1-step rollout. In our distributed MWIS solver, a GCN generates topology-aware node embeddings that are combined with per-link utilities before invoking a distributed greedy solver. Moreover, a novel reinforcement learning scheme is developed to train the GCN in a non-differentiable pipeline. Test results on medium-sized wireless networks show that our centralized heuristic can reach a near-optimal solution quickly, and our distributed heuristic based on a shallow GCN can reduce by nearly half the suboptimality gap of the distributed greedy solver with minimal increase in complexity. The proposed schedulers also exhibit good generalizability across graph and weight distributions.
    PocketNet: A Smaller Neural Network for Medical Image Analysis. (arXiv:2104.10745v3 [eess.IV] UPDATED)
    Medical imaging deep learning models are often large and complex, requiring specialized hardware to train and evaluate these models. To address such issues, we propose the PocketNet paradigm to reduce the size of deep learning models by throttling the growth of the number of channels in convolutional neural networks. We demonstrate that, for a range of segmentation and classification tasks, PocketNet architectures produce results comparable to that of conventional neural networks while reducing the number of parameters by multiple orders of magnitude, using up to 90% less GPU memory, and speeding up training times by up to 40%, thereby allowing such models to be trained and deployed in resource-constrained settings.
    Cohort Bias Adaptation in Aggregated Datasets for Lesion Segmentation. (arXiv:2108.00713v2 [eess.IV] UPDATED)
    Many automatic machine learning models developed for focal pathology (e.g. lesions, tumours) detection and segmentation perform well, but do not generalize as well to new patient cohorts, impeding their widespread adoption into real clinical contexts. One strategy to create a more diverse, generalizable training set is to naively pool datasets from different cohorts. Surprisingly, training on this \it{big data} does not necessarily increase, and may even reduce, overall performance and model generalizability, due to the existence of cohort biases that affect label distributions. In this paper, we propose a generalized affine conditioning framework to learn and account for cohort biases across multi-source datasets, which we call Source-Conditioned Instance Normalization (SCIN). Through extensive experimentation on three different, large scale, multi-scanner, multi-centre Multiple Sclerosis (MS) clinical trial MRI datasets, we show that our cohort bias adaptation method (1) improves performance of the network on pooled datasets relative to naively pooling datasets and (2) can quickly adapt to a new cohort by fine-tuning the instance normalization parameters, thus learning the new cohort bias with only 10 labelled samples.
    Assisted Learning for Organizations with Limited Data. (arXiv:2109.09307v3 [cs.LG] UPDATED)
    We develop an assisted learning framework for assisting organization-level learners to improve their learning performance with limited and imbalanced data. In particular, learners at the organization level usually have sufficient computation resource, but are subject to stringent collaboration policy and information privacy. Their limited imbalanced data often cause biased inference and sub-optimal decision-making. In our assisted learning framework, an organizational learner purchases assistance service from a service provider and aims to enhance its model performance within a few assistance rounds. We develop effective stochastic training algorithms for assisted deep learning and assisted reinforcement learning. Different from existing distributed algorithms that need to frequently transmit gradients or models, our framework allows the learner to only occasionally share information with the service provider, and still achieve a near-oracle model as if all the data were centralized.
    Masked Autoencoders As Spatiotemporal Learners. (arXiv:2205.09113v1 [cs.CV])
    This paper studies a conceptually simple extension of Masked Autoencoders (MAE) to spatiotemporal representation learning from videos. We randomly mask out spacetime patches in videos and learn an autoencoder to reconstruct them in pixels. Interestingly, we show that our MAE method can learn strong representations with almost no inductive bias on spacetime (only except for patch and positional embeddings), and spacetime-agnostic random masking performs the best. We observe that the optimal masking ratio is as high as 90% (vs. 75% on images), supporting the hypothesis that this ratio is related to information redundancy of the data. A high masking ratio leads to a large speedup, e.g., > 4x in wall-clock time or even more. We report competitive results on several challenging video datasets using vanilla Vision Transformers. We observe that MAE can outperform supervised pre-training by large margins. We further report encouraging results of training on real-world, uncurated Instagram data. Our study suggests that the general framework of masked autoencoding (BERT, MAE, etc.) can be a unified methodology for representation learning with minimal domain knowledge.
    Efficient PAC Reinforcement Learning in Regular Decision Processes. (arXiv:2105.06784v3 [cs.AI] UPDATED)
    Recently regular decision processes have been proposed as a well-behaved form of non-Markov decision process. Regular decision processes are characterised by a transition function and a reward function that depend on the whole history, though regularly (as in regular languages). In practice both the transition and the reward functions can be seen as finite transducers. We study reinforcement learning in regular decision processes. Our main contribution is to show that a near-optimal policy can be PAC-learned in polynomial time in a set of parameters that describe the underlying decision process. We argue that the identified set of parameters is minimal and it reasonably captures the difficulty of a regular decision process.
    Phy-Q: A Testbed for Physical Reasoning. (arXiv:2108.13696v2 [cs.AI] UPDATED)
    Humans are well-versed in reasoning about the behaviors of physical objects and choosing actions accordingly to accomplish tasks, while it remains a major challenge for AI. To facilitate research addressing this problem, we propose a new testbed that requires an agent to reason about physical scenarios and take an action appropriately. Inspired by the physical knowledge acquired in infancy and the capabilities required for robots to operate in real-world environments, we identify 15 essential physical scenarios. For each scenario, we create a wide variety of distinct task templates, and we ensure all the task templates within the same scenario can be solved by using one specific strategic physical rule. By having such a design, we evaluate two distinct levels of generalization, namely the local generalization and the broad generalization. We conduct an extensive evaluation with human players, learning agents with varying input types and architectures, and heuristic agents with different strategies. Inspired by how human IQ is calculated, we define the physical reasoning quotient (Phy-Q score) that reflects the physical reasoning intelligence of an agent. Our evaluation shows that 1) all agents are far below human performance, and 2) learning agents, even with good local generalization ability, struggle to learn the underlying physical reasoning rules and fail to generalize broadly. We encourage the development of intelligent agents that can reach the human level Phy-Q score. Website: https://github.com/phy-q/benchmark
    GSN: A Graph Neural Network Inspired by Spring Network. (arXiv:2201.12994v3 [cs.LG] UPDATED)
    The design of Graph Neural Networks (GNNs) that operate on both homophilous and heterophilous graphs has received research attention in recent years. Existing heterophilous GNNs, particularly those designed in the spatial domain, lack a convincing theoretical or physical motivation. Inspired by an old-fashioned spring network model, we propose the Graph Spring Network (GSN), a universal GNN model that works for homophilous and heterophilous graphs. We show that the GSN framework can interpret many GNN models from the perspective of potential energy minimization of a spring network with respect to various metrics, which entrusts strong physical motivations to these models. We also conduct experiments to demonstrate the performance of our GSN model on real-world datasets.
    Single-Shot Optical Neural Network. (arXiv:2205.09103v1 [cs.ET])
    As deep neural networks (DNNs) grow to solve increasingly complex problems, they are becoming limited by the latency and power consumption of existing digital processors. 'Weight-stationary' analog optical and electronic hardware has been proposed to reduce the compute resources required by DNNs by eliminating expensive weight updates; however, with scalability limited to an input vector length $K$ of hundreds of elements. Here, we present a scalable, single-shot-per-layer weight-stationary optical processor that leverages the advantages of free-space optics for passive optical copying and large-scale distribution of an input vector and integrated optoelectronics for static, reconfigurable weighting and the nonlinearity. We propose an optimized near-term CMOS-compatible system with $K = 1,000$ and beyond, and we calculate its theoretical total latency ($\sim$10 ns), energy consumption ($\sim$10 fJ/MAC) and throughput ($\sim$petaMAC/s) per layer. We also experimentally test DNN classification accuracy with single-shot analog optical encoding, copying and weighting of the MNIST handwritten digit dataset in a proof-of-concept system, achieving 94.7% (similar to the ground truth accuracy of 96.3%) without retraining on the hardware or data preprocessing. Lastly, we determine the upper bound on throughput of our system ($\sim$0.9 exaMAC/s), set by the maximum optical bandwidth before significant loss of accuracy. This joint use of wide spectral and spatial bandwidths enables highly efficient computing for next-generation DNNs.
    Greedy Actor-Critic: A New Conditional Cross-Entropy Method for Policy Improvement. (arXiv:1810.09103v3 [cs.LG] UPDATED)
    Many policy gradient methods are variants of Actor-Critic (AC), where a value function (critic) is learned to facilitate updating the parameterized policy (actor). The update to the actor involves a log-likelihood update weighted by the action-values, with the addition of entropy regularization for soft variants. In this work, we explore an alternative update for the actor, based on an extension of the cross entropy method (CEM) to condition on inputs (states). The idea is to start with a broader policy and slowly concentrate around maximal actions, using a maximum likelihood update towards actions in the top percentile per state. The speed of this concentration is controlled by a proposal policy, that concentrates at a slower rate than the actor. We first provide a policy improvement result in an idealized setting, and then prove that our conditional CEM (CCEM) strategy tracks a CEM update per state, even with changing action-values. We empirically show that our Greedy AC algorithm, that uses CCEM for the actor update, performs better than Soft AC and is much less sensitive to entropy-regularization.
    On the Efficiency of Entropic Regularized Algorithms for Optimal Transport. (arXiv:1906.01437v9 [cs.DS] UPDATED)
    We present several new complexity results for the entropic regularized algorithms that approximately solve the optimal transport (OT) problem between two discrete probability measures with at most $n$ atoms. First, we improve the complexity bound of a greedy variant of Sinkhorn, known as \textit{Greenkhorn}, from $\widetilde{O}(n^2\varepsilon^{-3})$ to $\widetilde{O}(n^2\varepsilon^{-2})$. Notably, our result can match the best known complexity bound of Sinkhorn and help clarify why Greenkhorn significantly outperforms Sinkhorn in practice in terms of row/column updates as observed by~\citet{Altschuler-2017-Near}. Second, we propose a new algorithm, which we refer to as \textit{APDAMD} and which generalizes an adaptive primal-dual accelerated gradient descent (APDAGD) algorithm~\citep{Dvurechensky-2018-Computational} with a prespecified mirror mapping $\phi$. We prove that APDAMD achieves the complexity bound of $\widetilde{O}(n^2\sqrt{\delta}\varepsilon^{-1})$ in which $\delta>0$ stands for the regularity of $\phi$. In addition, we show by a counterexample that the complexity bound of $\widetilde{O}(\min\{n^{9/4}\varepsilon^{-1}, n^2\varepsilon^{-2}\})$ proved for APDAGD before is invalid and give a refined complexity bound of $\widetilde{O}(n^{5/2}\varepsilon^{-1})$. Further, we develop a \textit{deterministic} accelerated variant of Sinkhorn via appeal to estimated sequence and prove the complexity bound of $\widetilde{O}(n^{7/3}\varepsilon^{-4/3})$. As such, we see that accelerated variant of Sinkhorn outperforms Sinkhorn and Greenkhorn in terms of $1/\varepsilon$ and APDAGD and accelerated alternating minimization (AAM)~\citep{Guminov-2021-Combination} in terms of $n$. Finally, we conduct the experiments on synthetic and real data and the numerical results show the efficiency of Greenkhorn, APDAMD and accelerated Sinkhorn in practice.
    Representation Learning for Content-Sensitive Anomaly Detection in Industrial Networks. (arXiv:2205.08953v1 [cs.LG])
    Using a convGRU-based autoencoder, this thesis proposes a framework to learn spatial-temporal aspects of raw network traffic in an unsupervised and protocol-agnostic manner. The learned representations are used to measure the effect on the results of a subsequent anomaly detection and are compared to the application without the extracted features. The evaluation showed, that the anomaly detection could not effectively be enhanced when applied on compressed traffic fragments for the context of network intrusion detection. Yet, the trained autoencoder successfully generates a compressed representation (code) of the network traffic, which hold spatial and temporal information. Based on the models residual loss, the autoencoder is also capable of detecting anomalies by itself. Lastly, an approach for a kind of model interpretability (LRP) was investigated in order to identify relevant areas within the raw input data, which is used to enrich alerts generated by an anomaly detection method.
    Meta-Learning Sparse Compression Networks. (arXiv:2205.08957v1 [stat.ML])
    Recent work in Deep Learning has re-imagined the representation of data as functions mapping from a coordinate space to an underlying continuous signal. When such functions are approximated by neural networks this introduces a compelling alternative to the more common multi-dimensional array representation. Recent work on such Implicit Neural Representations (INRs) has shown that - following careful architecture search - INRs can outperform established compression methods such as JPEG (e.g. Dupont et al., 2021). In this paper, we propose crucial steps towards making such ideas scalable: Firstly, we employ stateof-the-art network sparsification techniques to drastically improve compression. Secondly, introduce the first method allowing for sparsification to be employed in the inner-loop of commonly used Meta-Learning algorithms, drastically improving both compression and the computational cost of learning INRs. The generality of this formalism allows us to present results on diverse data modalities such as images, manifolds, signed distance functions, 3D shapes and scenes, several of which establish new state-of-the-art results.
    A weakly supervised framework for high-resolution crop yield forecasts. (arXiv:2205.09016v1 [cs.LG])
    Predictor inputs and label data for crop yield forecasting are not always available at the same spatial resolution. We propose a deep learning framework that uses high resolution inputs and low resolution labels to produce crop yield forecasts for both spatial levels. The forecasting model is calibrated by weak supervision from low resolution crop area and yield statistics. We evaluated the framework by disaggregating regional yields in Europe from parent statistical regions to sub-regions for five countries (Germany, Spain, France, Hungary, Italy) and two crops (soft wheat and potatoes). Performance of weakly supervised models was compared with linear trend models and Gradient-Boosted Decision Trees (GBDT). Higher resolution crop yield forecasts are useful to policymakers and other stakeholders. Weakly supervised deep learning methods provide a way to produce such forecasts even in the absence of high resolution yield data.
    Medical Deep Learning -- A systematic Meta-Review. (arXiv:2010.14881v5 [eess.IV] UPDATED)
    Deep learning (DL) has remarkably impacted several different scientific disciplines over the last few years. E.g., in image processing and analysis, DL algorithms were able to outperform other cutting-edge methods. Additionally, DL has delivered state-of-the-art results in tasks like autonomous driving, outclassing previous attempts. There are even instances where DL outperformed humans, for example with object recognition and gaming. DL is also showing vast potential in the medical domain. With the collection of large quantities of patient records and data, and a trend towards personalized treatments, there is a great need for automated and reliable processing and analysis of health information. Patient data is not only collected in clinical centers, like hospitals and private practices, but also by mobile healthcare apps or online websites. The abundance of collected patient data and the recent growth in the DL field has resulted in a large increase in research efforts. In Q2/2020, the search engine PubMed returned already over 11,000 results for the search term 'deep learning', and around 90% of these publications are from the last three years. However, even though PubMed represents the largest search engine in the medical field, it does not cover all medical-related publications. Hence, a complete overview of the field of 'medical deep learning' is almost impossible to obtain and acquiring a full overview of medical sub-fields is becoming increasingly more difficult. Nevertheless, several review and survey articles about medical DL have been published within the last few years. They focus, in general, on specific medical scenarios, like the analysis of medical images containing specific pathologies. With these surveys as a foundation, the aim of this article is to provide the first high-level, systematic meta-review of medical DL surveys.
    The Kernelized Taylor Diagram. (arXiv:2205.08864v1 [stat.ML])
    This paper presents the kernelized Taylor diagram, a graphical framework for visualizing similarities between data populations. The kernelized Taylor diagram builds on the widely used Taylor diagram, which is used to visualize similarities between populations. However, the Taylor diagram has several limitations such as not capturing non-linear relationships and sensitivity to outliers. To address such limitations, we propose the kernelized Taylor diagram. Our proposed kernelized Taylor diagram is capable of visualizing similarities between populations with minimal assumptions of the data distributions. The kernelized Taylor diagram relates the maximum mean discrepancy and the kernel mean embedding in a single diagram, a construction that, to the best of our knowledge, have not been devised prior to this work. We believe that the kernelized Taylor diagram can be a valuable tool in data visualization.
    FiLM: Frequency improved Legendre Memory Model for Long-term Time Series Forecasting. (arXiv:2205.08897v1 [cs.LG])
    Recent studies have shown the promising performance of deep learning models (e.g., RNN and Transformer) for long-term time series forecasting. These studies mostly focus on designing deep models to effectively combine historical information for long-term forecasting. However, the question of how to effectively represent historical information for long-term forecasting has not received enough attention, limiting our capacity to exploit powerful deep learning models. The main challenge in time series representation is how to handle the dilemma between accurately preserving historical information and reducing the impact of noisy signals in the past. To this end, we design a \textbf{F}requency \textbf{i}mproved \textbf{L}egendre \textbf{M}emory model, or {\bf FiLM} for short: it introduces Legendre Polynomial projections to preserve historical information accurately and Fourier projections plus low-rank approximation to remove noisy signals. Our empirical studies show that the proposed FiLM improves the accuracy of state-of-the-art models by a significant margin (\textbf{19.2\%}, \textbf{22.6\%}) in multivariate and univariate long-term forecasting, respectively. In addition, dimensionality reduction introduced by low-rank approximation leads to a dramatic improvement in computational efficiency. We also demonstrate that the representation module developed in this work can be used as a general plug-in to improve the performance of most deep learning modules for long-term forecasting. Code will be released soon
    Fast Neural Network based Solving of Partial Differential Equations. (arXiv:2205.08978v1 [cs.LG])
    We present a novel method for using Neural Networks (NNs) for finding solutions to a class of Partial Differential Equations (PDEs). Our method builds on recent advances in Neural Radiance Field research (NeRFs) and allows for a NN to converge to a PDE solution much faster than classic Physically Informed Neural Network (PINNs) approaches.
    Learning Shared Kernel Models: the Shared Kernel EM algorithm. (arXiv:2205.09041v1 [cs.LG])
    Expectation maximisation (EM) is an unsupervised learning method for estimating the parameters of a finite mixture distribution. It works by introducing "hidden" or "latent" variables via Baum's auxiliary function $Q$ that allow the joint data likelihood to be expressed as a product of simple factors. The relevance of EM has increased since the introduction of the variational lower bound (VLB): the VLB differs from Baum's auxiliary function only by the entropy of the PDF of the latent variables $Z$. We first present a rederivation of the standard EM algorithm using data association ideas from the field of multiple target tracking, using $K$-valued scalar data association hypotheses rather than the usual binary indicator vectors. The same method is then applied to a little known but much more general type of supervised EM algorithm for shared kernel models, related to probabilistic radial basis function networks. We address a number of shortcomings in the derivations that have been published previously in this area. In particular, we give theoretically rigorous derivations of (i) the complete data likelihood; (ii) Baum's auxiliary function (the E-step) and (iii) the maximisation (M-step) in the case of Gaussian shared kernel models. The subsequent algorithm, called shared kernel EM (SKEM), is then applied to a digit recognition problem using a novel 7-segment digit representation. Variants of the algorithm that use different numbers of features and different EM algorithm dimensions are compared in terms of mean accuracy and mean IoU. A simplified classifier is proposed that decomposes the joint data PDF as a product of lower order PDFs over non-overlapping subsets of variables. The effect of different numbers of assumed mixture components $K$ is also investigated. High-level source code for the data generation and SKEM algorithm is provided.
    Unsupervised Features Ranking via Coalitional Game Theory for Categorical Data. (arXiv:2205.09060v1 [cs.LG])
    Not all real-world data are labeled, and when labels are not available, it is often costly to obtain them. Moreover, as many algorithms suffer from the curse of dimensionality, reducing the features in the data to a smaller set is often of great utility. Unsupervised feature selection aims to reduce the number of features, often using feature importance scores to quantify the relevancy of single features to the task at hand. These scores can be based only on the distribution of variables and the quantification of their interactions. The previous literature, mainly investigating anomaly detection and clusters, fails to address the redundancy-elimination issue. We propose an evaluation of correlations among features to compute feature importance scores representing the contribution of single features in explaining the dataset's structure. Based on Coalitional Game Theory, our feature importance scores include a notion of redundancy awareness making them a tool to achieve redundancy-free feature selection. We show that the deriving features' selection outperforms competing methods in lowering the redundancy rate while maximizing the information contained in the data. We also introduce an approximated version of the algorithm to reduce the complexity of Shapley values' computations.
    Maslow's Hammer for Catastrophic Forgetting: Node Re-Use vs Node Activation. (arXiv:2205.09029v1 [stat.ML])
    Continual learning - learning new tasks in sequence while maintaining performance on old tasks - remains particularly challenging for artificial neural networks. Surprisingly, the amount of forgetting does not increase with the dissimilarity between the learned tasks, but appears to be worst in an intermediate similarity regime. In this paper we theoretically analyse both a synthetic teacher-student framework and a real data setup to provide an explanation of this phenomenon that we name Maslow's hammer hypothesis. Our analysis reveals the presence of a trade-off between node activation and node re-use that results in worst forgetting in the intermediate regime. Using this understanding we reinterpret popular algorithmic interventions for catastrophic interference in terms of this trade-off, and identify the regimes in which they are most effective.
    Moderately Supervised Learning: Definition, Framework and Generality. (arXiv:2008.11945v3 [cs.CV] UPDATED)
    Learning with supervision has achieved remarkable success in numerous artificial intelligence (AI) applications. In the current literature, by referring to the properties of the labels prepared for the training data set, learning with supervision is categorized as supervised learning (SL) and weakly supervised learning (WSL). SL concerns the situation where the training data set is assigned with ideal labels, while WSL concerns the situation where the training data set is assigned with non-ideal labels. However, without considering the properties of the transformation from the given labels to learnable targets, the definition of SL is relatively abstract, which conceals some details that can be critical to building the appropriate solutions for specific SL tasks. Thus, it is desirable to reveal these details more concretely. This article attempts to achieve this goal by expanding the categorization of SL and investigating the sub-type that plays the central role in SL. More specifically, taking into consideration the properties of the transformation from the given labels to learnable targets, we firstly categorize SL into three narrower sub-types. Then we focus on the moderately supervised learning (MSL) sub-type that concerns the situation where the given labels are ideal, but due to the simplicity in annotation, careful designs are required to transform the given labels into learnable targets. From the perspectives of the definition, framework and generality, we comprehensively illustrate MSL and reveal what details are concealed by the abstractness of the definition of SL. At the meantime, the whole presentation of this paper as well establishes a tutorial for AI application engineers to refer to viewing a problem to be solved from the mathematicians' vision.
    FedAdapt: Adaptive Offloading for IoT Devices in Federated Learning. (arXiv:2107.04271v5 [cs.DC] UPDATED)
    Applying Federated Learning (FL) on Internet-of-Things devices is necessitated by the large volumes of data they produce and growing concerns of data privacy. However, there are three challenges that need to be addressed to make FL efficient: (i) execution on devices with limited computational capabilities, (ii) accounting for stragglers due to computational heterogeneity of devices, and (iii) adaptation to the changing network bandwidths. This paper presents FedAdapt, an adaptive offloading FL framework to mitigate the aforementioned challenges. FedAdapt accelerates local training in computationally constrained devices by leveraging layer offloading of deep neural networks (DNNs) to servers. Further, FedAdapt adopts reinforcement learning based optimization and clustering to adaptively identify which layers of the DNN should be offloaded for each individual device on to a server to tackle the challenges of computational heterogeneity and changing network bandwidth. Experimental studies are carried out on a lab-based testbed and it is demonstrated that by offloading a DNN from the device to the server FedAdapt reduces the training time of a typical IoT device by over half compared to classic FL. The training time of extreme stragglers and the overall training time can be reduced by up to 57%. Furthermore, with changing network bandwidth, FedAdapt is demonstrated to reduce the training time by up to 40% when compared to classic FL, without sacrificing accuracy.
    SOUL: An Energy-Efficient Unsupervised Online Learning Seizure Detection Classifier. (arXiv:2110.02169v2 [eess.SP] UPDATED)
    Implantable devices that record neural activity and detect seizures have been adopted to issue warnings or trigger neurostimulation to suppress epileptic seizures. Typical seizure detection systems rely on high-accuracy offline-trained machine learning classifiers that require manual retraining when seizure patterns change over long periods of time. For an implantable seizure detection system, a low power, at-the-edge, online learning algorithm can be employed to dynamically adapt to the neural signal drifts, thereby maintaining high accuracy without external intervention. This work proposes SOUL: Stochastic-gradient-descent-based Online Unsupervised Logistic regression classifier. After an initial offline training phase, continuous online unsupervised classifier updates are applied in situ, which improves sensitivity in patients with drifting seizure features. SOUL was tested on two human electroencephalography (EEG) datasets: the CHB-MIT scalp EEG dataset, and a long (>100 hours) NeuroVista intracranial EEG dataset. It was able to achieve an average sensitivity of 97.5% and 97.9% for the two datasets respectively, at >95% specificity. Sensitivity improved by at most 8.2% on long-term data when compared to a typical seizure detection classifier. SOUL was fabricated in TSMC's 28 nm process occupying 0.1 mm2 and achieves 1.5 nJ/classification energy efficiency, which is at least 24x more efficient than state-of-the-art.
    SoQal: Selective Oracle Questioning for Consistency Based Active Learning of Cardiac Signals. (arXiv:2004.09557v3 [cs.LG] UPDATED)
    Clinical settings are often characterized by abundant unlabelled data and limited labelled data. This is typically driven by the high burden placed on oracles (e.g., physicians) to provide annotations. One way to mitigate this burden is via active learning (AL) which involves the (a) acquisition and (b) annotation of informative unlabelled instances. Whereas previous work addresses either one of these elements independently, we propose an AL framework that addresses both. For acquisition, we propose Bayesian Active Learning by Consistency (BALC), a sub-framework which perturbs both instances and network parameters and quantifies changes in the network output probability distribution. For annotation, we propose SoQal, a sub-framework that dynamically determines whether, for each acquired unlabelled instance, to request a label from an oracle or to pseudo-label it instead. We show that BALC can outperform start-of-the-art acquisition functions such as BALD, and SoQal outperforms baseline methods even in the presence of a noisy oracle.
    SoK: The Impact of Unlabelled Data in Cyberthreat Detection. (arXiv:2205.08944v1 [cs.CR])
    Machine learning (ML) has become an important paradigm for cyberthreat detection (CTD) in the recent years. A substantial research effort has been invested in the development of specialized algorithms for CTD tasks. From the operational perspective, however, the progress of ML-based CTD is hindered by the difficulty in obtaining the large sets of labelled data to train ML detectors. A potential solution to this problem are semisupervised learning (SsL) methods, which combine small labelled datasets with large amounts of unlabelled data. This paper is aimed at systematization of existing work on SsL for CTD and, in particular, on understanding the utility of unlabelled data in such systems. To this end, we analyze the cost of labelling in various CTD tasks and develop a formal cost model for SsL in this context. Building on this foundation, we formalize a set of requirements for evaluation of SsL methods, which elucidates the contribution of unlabelled data. We review the state-of-the-art and observe that no previous work meets such requirements. To address this problem, we propose a framework for assessing the benefits of unlabelled data in SsL. We showcase an application of this framework by performing the first benchmark evaluation that highlights the tradeoffs of 9 existing SsL methods on 9 public datasets. Our findings verify that, in some cases, unlabelled data provides a small, but statistically significant, performance gain. This paper highlights that SsL in CTD has a lot of room for improvement, which should stimulate future research in this field.
    Entity Alignment with Reliable Path Reasoning and Relation-Aware Heterogeneous Graph Transformer. (arXiv:2205.08806v1 [cs.CL])
    Entity Alignment (EA) has attracted widespread attention in both academia and industry, which aims to seek entities with same meanings from different Knowledge Graphs (KGs). There are substantial multi-step relation paths between entities in KGs, indicating the semantic relations of entities. However, existing methods rarely consider path information because not all natural paths facilitate for EA judgment. In this paper, we propose a more effective entity alignment framework, RPR-RHGT, which integrates relation and path structure information, as well as the heterogeneous information in KGs. Impressively, an initial reliable path reasoning algorithm is developed to generate the paths favorable for EA task from the relation structures of KGs, which is the first algorithm in the literature to successfully use unrestricted path information. In addition, to efficiently capture heterogeneous features in entity neighborhoods, a relation-aware heterogeneous graph transformer is designed to model the relation and path structures of KGs. Extensive experiments on three well-known datasets show RPR-RHGT significantly outperforms 11 state-of-the-art methods, exceeding the best performing baseline up to 8.62% on Hits@1. We also show its better performance than the baselines on different ratios of training set, and harder datasets.
    Automating In-Network Machine Learning. (arXiv:2205.08824v1 [cs.NI])
    Using programmable network devices to aid in-network machine learning has been the focus of significant research. However, most of the research was of a limited scope, providing a proof of concept or describing a closed-source algorithm. To date, no general solution has been provided for mapping machine learning algorithms to programmable network devices. In this paper, we present Planter, an open-source, modular framework for mapping trained machine learning models to programmable devices. Planter supports a wide range of machine learning models, multiple targets and can be easily extended. The evaluation of Planter compares different mapping approaches, and demonstrates the feasibility, performance, and resource efficiency for applications such as anomaly detection, financial transactions, and quality of experience. The results show that Planter-based in-network machine learning algorithms can run at line rate, have a negligible effect on latency, coexist with standard switching functionality, and have no or minor accuracy trade-offs.
    DL4DS -- Deep Learning for empirical DownScaling. (arXiv:2205.08967v1 [cs.LG])
    A common task in Earth Sciences is to infer climate information at local and regional scales from global climate models. Dynamical downscaling requires running expensive numerical models at high resolution which can be prohibitive due to long model runtimes. On the other hand, statistical downscaling techniques present an alternative approach for learning links between the large- and local-scale climate in a more efficient way. A large number of deep neural network-based approaches for statistical downscaling have been proposed in recent years, mostly based on convolutional architectures developed for computer vision and super-resolution tasks. This paper presents DL4DS, Deep Learning for empirical DownScaling, a python library that implements a wide variety of state-of-the-art and novel algorithms for downscaling gridded Earth Science data with deep neural networks. DL4DS has been designed with the goal of providing a general framework for training convolutional neural networks with configurable architectures and learning strategies to facilitate the conduction of comparative and ablation studies in a robust way. We showcase the capabilities of DL4DS on air quality CAMS data over the western Mediterranean area. The DL4DS library can be found in this repository: https://github.com/carlos-gg/dl4ds
    One Explanation to Rule them All -- Ensemble Consistent Explanations. (arXiv:2205.08974v1 [cs.AI])
    Transparency is a major requirement of modern AI based decision making systems deployed in real world. A popular approach for achieving transparency is by means of explanations. A wide variety of different explanations have been proposed for single decision making systems. In practice it is often the case to have a set (i.e. ensemble) of decisions that are used instead of a single decision only, in particular in complex systems. Unfortunately, explanation methods for single decision making systems are not easily applicable to ensembles -- i.e. they would yield an ensemble of individual explanations which are not necessarily consistent, hence less useful and more difficult to understand than a single consistent explanation of all observed phenomena. We propose a novel concept for consistently explaining an ensemble of decisions locally with a single explanation -- we introduce a formal concept, as well as a specific implementation using counterfactual explanations.
    Price Interpretability of Prediction Markets: A Convergence Analysis. (arXiv:2205.08913v1 [q-fin.TR])
    Prediction markets are long known for prediction accuracy. However, there is still a lack of systematic understanding of how prediction markets aggregate information and why they work so well. This work proposes a multivariate utility (MU)-based mechanism that unifies several existing prediction market-making schemes. Based on this mechanism, we derive convergence results for markets with myopic, risk-averse traders who repeatedly interact with the market maker. We show that the resulting limiting wealth distribution lies on the Pareto efficient frontier defined by all market participants' utilities. With the help of this result, we establish both analytical and numerical results for the limiting price for different market models. We show that the limiting price converges to the geometric mean of agents' beliefs for exponential utility-based markets. For risk measure-based markets, we construct a risk measure family that meets the convergence requirements and show that the limiting price can converge to a weighted power mean of agent beliefs. For markets based on hyperbolic absolute risk aversion (HARA) utilities, we show that the limiting price is also a risk-adjusted weighted power mean of agent beliefs, even though the trading order will affect the aggregation weights. We further propose an approximation scheme for the limiting price under the HARA utility family. We show through numerical experiments that our approximation scheme works well in predicting the convergent prices.
    POViT: Vision Transformer for Multi-objective Design and Characterization of Nanophotonic Devices. (arXiv:2205.09045v1 [cs.LG])
    We solve a fundamental challenge in semiconductor IC design: the fast and accurate characterization of nanoscale photonic devices. Much like the fusion between AI and EDA, many efforts have been made to apply DNNs such as convolutional neural networks (CNN) to prototype and characterize next-gen optoelectronic devices commonly found in photonic integrated circuits (PIC) and LiDAR. These prior works generally strive to predict the quality factor (Q) and modal volume (V) of for instance, photonic crystals, with ultra-high accuracy and speed. However, state-of-the-art models are still far from being directly applicable in the real-world: e.g. the correlation coefficient of V ($V_{coeff}$ ) is only about 80%, which is much lower than what it takes to generate reliable and reproducible nanophotonic designs. Recently, attention-based transformer models have attracted extensive interests and been widely used in CV and NLP. In this work, we propose the first-ever Transformer model (POViT) to efficiently design and simulate semiconductor photonic devices with multiple objectives. Unlike the standard Vision Transformer (ViT), we supplied photonic crystals as data input and changed the activation layer from GELU to an absolute-value function (ABS). Our experiments show that POViT exceeds results reported by previous models significantly. The correlation coefficient $V_{coeff}$ increases by over 12% (i.e., to 92.0%) and the prediction errors of Q is reduced by an order of magnitude, among several other key metric improvements. Our work has the potential to drive the expansion of EDA to fully automated photonic design. The complete dataset and code will be released to aid researchers endeavoring in the interdisciplinary field of physics and computer science.
    Slowly Changing Adversarial Bandit Algorithms are Provably Efficient for Discounted MDPs. (arXiv:2205.09056v1 [cs.LG])
    Reinforcement learning (RL) generalizes bandit problems with additional difficulties on longer planning horzion and unknown transition kernel. We show that, under some mild assumptions, \textbf{any} slowly changing adversarial bandit algorithm enjoys near-optimal regret in adversarial bandits can achieve near-optimal (expected) regret in non-episodic discounted MDPs. The slowly changing property required by our generalization is mild, see e.g. (Even-Dar et al. 2009, Neu et al. 2010), we also show, for example, \expt~(Auer et al. 2002) is slowly changing and enjoys near-optimal regret in MDPs.
    Multilayer Perceptron Based Stress Evolution Analysis under DC Current Stressing for Multi-segment Wires. (arXiv:2205.09065v1 [cs.LG])
    Electromigration (EM) is one of the major concerns in the reliability analysis of very large scale integration (VLSI) systems due to the continuous technology scaling. Accurately predicting the time-to-failure of integrated circuits (IC) becomes increasingly important for modern IC design. However, traditional methods are often not sufficiently accurate, leading to undesirable over-design especially in advanced technology nodes. In this paper, we propose an approach using multilayer perceptrons (MLP) to compute stress evolution in the interconnect trees during the void nucleation phase. The availability of a customized trial function for neural network training holds the promise of finding dynamic mesh-free stress evolution on complex interconnect trees under time-varying temperatures. Specifically, we formulate a new objective function considering the EM-induced coupled partial differential equations (PDEs), boundary conditions (BCs), and initial conditions to enforce the physics-based constraints in the spatial-temporal domain. The proposed model avoids meshing and reduces temporal iterations compared with conventional numerical approaches like FEM. Numerical results confirm its advantages on accuracy and computational performance.
    Multi-disciplinary fairness considerations in machine learning for clinical trials. (arXiv:2205.08875v1 [cs.LG])
    While interest in the application of machine learning to improve healthcare has grown tremendously in recent years, a number of barriers prevent deployment in medical practice. A notable concern is the potential to exacerbate entrenched biases and existing health disparities in society. The area of fairness in machine learning seeks to address these issues of equity; however, appropriate approaches are context-dependent, necessitating domain-specific consideration. We focus on clinical trials, i.e., research studies conducted on humans to evaluate medical treatments. Clinical trials are a relatively under-explored application in machine learning for healthcare, in part due to complex ethical, legal, and regulatory requirements and high costs. Our aim is to provide a multi-disciplinary assessment of how fairness for machine learning fits into the context of clinical trials research and practice. We start by reviewing the current ethical considerations and guidelines for clinical trials and examine their relationship with common definitions of fairness in machine learning. We examine potential sources of unfairness in clinical trials, providing concrete examples, and discuss the role machine learning might play in either mitigating potential biases or exacerbating them when applied without care. Particular focus is given to adaptive clinical trials, which may employ machine learning. Finally, we highlight concepts that require further investigation and development, and emphasize new approaches to fairness that may be relevant to the design of clinical trials.
    Deep Features for CBIR with Scarce Data using Hebbian Learning. (arXiv:2205.08935v1 [cs.CV])
    Features extracted from Deep Neural Networks (DNNs) have proven to be very effective in the context of Content Based Image Retrieval (CBIR). In recent work, biologically inspired \textit{Hebbian} learning algorithms have shown promises for DNN training. In this contribution, we study the performance of such algorithms in the development of feature extractors for CBIR tasks. Specifically, we consider a semi-supervised learning strategy in two steps: first, an unsupervised pre-training stage is performed using Hebbian learning on the image dataset; second, the network is fine-tuned using supervised Stochastic Gradient Descent (SGD) training. For the unsupervised pre-training stage, we explore the nonlinear Hebbian Principal Component Analysis (HPCA) learning rule. For the supervised fine-tuning stage, we assume sample efficiency scenarios, in which the amount of labeled samples is just a small fraction of the whole dataset. Our experimental analysis, conducted on the CIFAR10 and CIFAR100 datasets shows that, when few labeled samples are available, our Hebbian approach provides relevant improvements compared to various alternative methods.
    One-way Explainability Isn't The Message. (arXiv:2205.08954v1 [cs.LG])
    Recent engineering developments in specialised computational hardware, data-acquisition and storage technology have seen the emergence of Machine Learning (ML) as a powerful form of data analysis with widespread applicability beyond its historical roots in the design of autonomous agents. However -- possibly because of its origins in the development of agents capable of self-discovery -- relatively little attention has been paid to the interaction between people and ML. In this paper we are concerned with the use of ML in automated or semi-automated tools that assist one or more human decision makers. We argue that requirements on both human and machine in this context are significantly different to the use of ML either as part of autonomous agents for self-discovery or as part statistical data analysis. Our principal position is that the design of such human-machine systems should be driven by repeated, two-way intelligibility of information rather than one-way explainability of the ML-system's recommendations. Iterated rounds of intelligible information exchange, we think, will characterise the kinds of collaboration that will be needed to understand complex phenomena for which neither man or machine have complete answers. We propose operational principles -- we call them Intelligibility Axioms -- to guide the design of a collaborative decision-support system. The principles are concerned with: (a) what it means for information provided by the human to be intelligible to the ML system; and (b) what it means for an explanation provided by an ML system to be intelligible to a human. Using examples from the literature on the use of ML for drug-design and in medicine, we demonstrate cases where the conditions of the axioms are met. We describe some additional requirements needed for the design of a truly collaborative decision-support system.
    Predicting failure characteristics of structural materials via deep learning based on nondestructive void topology. (arXiv:2205.09075v1 [cond-mat.mtrl-sci])
    Accurate predictions of the failure progression of structural materials is critical for preventing failure-induced accidents. Despite considerable mechanics modeling-based efforts, accurate prediction remains a challenging task in real-world environments due to unexpected damage factors and defect evolutions. Here, we report a novel method for predicting material failure characteristics that uniquely combines nondestructive X-ray computed tomography (X-CT), persistent homology (PH), and deep multimodal learning (DML). The combined method exploits the microstructural defect state at the time of material examination as an input, and outputs the failure-related properties. Our method is demonstrated to be effective using two types of fracture datasets (tensile and fatigue datasets) with ferritic low alloy steel as a representative structural material. The method achieves a mean absolute error (MAE) of 0.09 in predicting the local strain with the tensile dataset and an MAE of 0.14 in predicting the fracture progress with the fatigue dataset. These high accuracies are mainly due to PH processing of the X-CT images, which transforms complex and noisy three-dimensional X-CT images into compact two-dimensional persistence diagrams that preserve key topological features such as the internal void size, density, and distribution. The combined PH and DML processing of 3D X-CT data is our unique approach enabling reliable failure predictions at the time of material examination based on void topology progressions, and the method can be extended to various nondestructive failure tests for practical use.
    On-device modeling of user's social context and familiar places from smartphone-embedded sensor data. (arXiv:2205.08790v1 [cs.LG])
    Context modeling and recognition represent complex tasks that allow mobile and ubiquitous computing applications to adapt to the user's situation. Current solutions mainly focus on limited context information generally processed on centralized architectures, potentially exposing users' personal data to privacy leakage, and missing personalization features. For these reasons on-device context modeling and recognition represent the current research trend in this area. Among the different information characterizing the user's context in mobile environments, social interactions and visited locations remarkably contribute to the characterization of daily life scenarios. In this paper we propose a novel, unsupervised and lightweight approach to model the user's social context and her locations based on ego networks directly on the user mobile device. Relying on this model, the system is able to extract high-level and semantic-rich context features from smartphone-embedded sensors data. Specifically, for the social context it exploits data related to both physical and cyber social interactions among users and their devices. As far as location context is concerned, we assume that it is more relevant to model the familiarity degree of a specific location for the user's context than the raw location data, both in terms of GPS coordinates and proximity devices. By using 5 real-world datasets, we assess the structure of the social and location ego networks, we provide a semantic evaluation of the proposed models and a complexity evaluation in terms of mobile computing performance. Finally, we demonstrate the relevance of the extracted features by showing the performance of 3 machine learning algorithms to recognize daily-life situations, obtaining an improvement of 3% of AUROC, 9% of Precision, and 5% in terms of Recall with respect to use only features related to physical context.
    Policy Distillation with Selective Input Gradient Regularization for Efficient Interpretability. (arXiv:2205.08685v1 [cs.LG])
    Although deep Reinforcement Learning (RL) has proven successful in a wide range of tasks, one challenge it faces is interpretability when applied to real-world problems. Saliency maps are frequently used to provide interpretability for deep neural networks. However, in the RL domain, existing saliency map approaches are either computationally expensive and thus cannot satisfy the real-time requirement of real-world scenarios or cannot produce interpretable saliency maps for RL policies. In this work, we propose an approach of Distillation with selective Input Gradient Regularization (DIGR) which uses policy distillation and input gradient regularization to produce new policies that achieve both high interpretability and computation efficiency in generating saliency maps. Our approach is also found to improve the robustness of RL policies to multiple adversarial attacks. We conduct experiments on three tasks, MiniGrid (Fetch Object), Atari (Breakout) and CARLA Autonomous Driving, to demonstrate the importance and effectiveness of our approach.
    Optimal Adaptive Prediction Intervals for Electricity Load Forecasting in Distribution Systems via Reinforcement Learning. (arXiv:2205.08698v1 [stat.AP])
    Prediction intervals offer an effective tool for quantifying the uncertainty of loads in distribution systems. The traditional central PIs cannot adapt well to skewed distributions, and their offline training fashion is vulnerable to unforeseen changes in future load patterns. Therefore, we propose an optimal PI estimation approach, which is online and adaptive to different data distributions by adaptively determining symmetric or asymmetric probability proportion pairs for quantiles. It relies on the online learning ability of reinforcement learning to integrate the two online tasks, i.e., the adaptive selection of probability proportion pairs and quantile predictions, both of which are modeled by neural networks. As such, the quality of quantiles-formed PI can guide the selection process of optimal probability proportion pairs, which forms a closed loop to improve the quality of PIs. Furthermore, to improve the learning efficiency of quantile forecasts, a prioritized experience replay strategy is proposed for online quantile regression processes. Case studies on both load and net load demonstrate that the proposed method can better adapt to data distribution compared with online central PIs method. Compared with offline-trained methods, it obtains PIs with better quality and is more robust against concept drift.
    Customizing ML Predictions for Online Algorithms. (arXiv:2205.08715v1 [cs.LG])
    A popular line of recent research incorporates ML advice in the design of online algorithms to improve their performance in typical instances. These papers treat the ML algorithm as a black-box, and redesign online algorithms to take advantage of ML predictions. In this paper, we ask the complementary question: can we redesign ML algorithms to provide better predictions for online algorithms? We explore this question in the context of the classic rent-or-buy problem, and show that incorporating optimization benchmarks in ML loss functions leads to significantly better performance, while maintaining a worst-case adversarial result when the advice is completely wrong. We support this finding both through theoretical bounds and numerical simulations.
    No More Pesky Hyperparameters: Offline Hyperparameter Tuning for RL. (arXiv:2205.08716v1 [cs.LG])
    The performance of reinforcement learning (RL) agents is sensitive to the choice of hyperparameters. In real-world settings like robotics or industrial control systems, however, testing different hyperparameter configurations directly on the environment can be financially prohibitive, dangerous, or time consuming. We propose a new approach to tune hyperparameters from offline logs of data, to fully specify the hyperparameters for an RL agent that learns online in the real world. The approach is conceptually simple: we first learn a model of the environment from the offline data, which we call a calibration model, and then simulate learning in the calibration model to identify promising hyperparameters. We identify several criteria to make this strategy effective, and develop an approach that satisfies these criteria. We empirically investigate the method in a variety of settings to identify when it is effective and when it fails.  ( 2 min )
    Hyperparameter Optimization with Neural Network Pruning. (arXiv:2205.08695v1 [cs.CV])
    Since the deep learning model is highly dependent on hyperparameters, hyperparameter optimization is essential in developing deep learning model-based applications, even if it takes a long time. As service development using deep learning models has gradually become competitive, many developers highly demand rapid hyperparameter optimization algorithms. In order to keep pace with the needs of faster hyperparameter optimization algorithms, researchers are focusing on improving the speed of hyperparameter optimization algorithm. However, the huge time consumption of hyperparameter optimization due to the high computational cost of the deep learning model itself has not been dealt with in-depth. Like using surrogate model in Bayesian optimization, to solve this problem, it is necessary to consider proxy model for a neural network (N_B) to be used for hyperparameter optimization. Inspired by the main goal of neural network pruning, i.e., high computational cost reduction and performance preservation, we presumed that the neural network (N_P) obtained through neural network pruning would be a good proxy model of N_B. In order to verify our idea, we performed extensive experiments by using CIFAR10, CFIAR100, and TinyImageNet datasets and three generally-used neural networks and three representative hyperparameter optmization methods. Through these experiments, we verified that N_P can be a good proxy model of N_B for rapid hyperparameter optimization. The proposed hyperparameter optimization framework can reduce the amount of time up to 37%.  ( 2 min )
    Marginal and Joint Cross-Entropies & Predictives for Online Bayesian Inference, Active Learning, and Active Sampling. (arXiv:2205.08766v1 [cs.LG])
    Principled Bayesian deep learning (BDL) does not live up to its potential when we only focus on marginal predictive distributions (marginal predictives). Recent works have highlighted the importance of joint predictives for (Bayesian) sequential decision making from a theoretical and synthetic perspective. We provide additional practical arguments grounded in real-world applications for focusing on joint predictives: we discuss online Bayesian inference, which would allow us to make predictions while taking into account additional data without retraining, and we propose new challenging evaluation settings using active learning and active sampling. These settings are motivated by an examination of marginal and joint predictives, their respective cross-entropies, and their place in offline and online learning. They are more realistic than previously suggested ones, building on work by Wen et al. (2021) and Osband et al. (2022), and focus on evaluating the performance of approximate BNNs in an online supervised setting. Initial experiments, however, raise questions on the feasibility of these ideas in high-dimensional parameter spaces with current BDL inference techniques, and we suggest experiments that might help shed further light on the practicality of current research for these problems. Importantly, our work highlights previously unidentified gaps in current research and the need for better approximate joint predictives.  ( 2 min )
    Spatial-Temporal Interactive Dynamic Graph Convolution Network for Traffic Forecasting. (arXiv:2205.08689v1 [cs.LG])
    Accurate traffic forecasting is essential for smart cities to achieve traffic flow control, route planning, and detection. Although many spatial-temporal methods are currently proposed, these methods are deficient in capturing the spatial-temporal dependence of traffic data synchronously. In addition, most of the methods ignore the dynamically changing correlations between road network nodes that arise as traffic data changes. To address the above challenges, we propose a neural network-based Spatial-Temporal Interactive Dynamic Graph Convolutional Network (STIDGCN) for traffic forecasting in this paper. In STIDGCN, we propose an interactive dynamic graph convolution structure, which first divides the sequences at intervals and captures the spatial-temporal dependence of the traffic data simultaneously through an interactive learning strategy for effective long-term prediction. We propose a novel dynamic graph convolution module consisting of a graph generator, fusion graph convolution. The dynamic graph convolution module can use the input traffic data, pre-defined graph structure to generate a graph structure and fuse it with the defined adaptive adjacency matrix, which is used to achieve the filling of the pre-defined graph structure and simulate the generation of dynamic associations between nodes in the road network. Extensive experiments on four real-world traffic flow datasets demonstrate that STIDGCN outperforms the state-of-the-art baseline.  ( 2 min )
    Deep-learned orthogonal basis patterns for fast, noise-robust single-pixel imaging. (arXiv:2205.08736v1 [eess.IV])
    Single-pixel imaging (SPI) is a novel, unconventional method that goes beyond the notion of traditional cameras but can be computationally expensive and slow for real-time applications. Deep learning has been proposed as an alternative approach for solving the SPI reconstruction problem, but a detailed analysis of its performance and generated basis patterns when used for SPI is limited. We present a modified deep convolutional autoencoder network (DCAN) for SPI on 64x64 pixel images with up to 6.25% compression ratio and apply binary and orthogonality regularizers during training. Training a DCAN with these regularizers allows it to learn multiple measurement bases that have combinations of binary or non-binary, and orthogonal or non-orthogonal patterns. We compare the reconstruction quality, orthogonality of the patterns, and robustness to noise of the resulting DCAN models to traditional SPI reconstruction algorithms (such as Total Variation minimization and Fourier Transform). Our DCAN models can be trained to be robust to noise while still having fast enough reconstruction times (~3 ms per frame) to be viable for real-time imaging.  ( 2 min )
    Revisiting PINNs: Generative Adversarial Physics-informed Neural Networks and Point-weighting Method. (arXiv:2205.08754v1 [cs.LG])
    Physics-informed neural networks (PINNs) provide a deep learning framework for numerically solving partial differential equations (PDEs), and have been widely used in a variety of PDE problems. However, there still remain some challenges in the application of PINNs: 1) the mechanism of PINNs is unsuitable (at least cannot be directly applied) to exploiting a small size of (usually very few) extra informative samples to refine the networks; and 2) the efficiency of training PINNs often becomes low for some complicated PDEs. In this paper, we propose the generative adversarial physics-informed neural network (GA-PINN), which integrates the generative adversarial (GA) mechanism with the structure of PINNs, to improve the performance of PINNs by exploiting only a small size of exact solutions to the PDEs. Inspired from the weighting strategy of the Adaboost method, we then introduce a point-weighting (PW) method to improve the training efficiency of PINNs, where the weight of each sample point is adaptively updated at each training iteration. The numerical experiments show that GA-PINNs outperform PINNs in many well-known PDEs and the PW method also improves the efficiency of training PINNs and GA-PINNs.  ( 2 min )
    CARNet: A Dynamic Autoencoder for Learning Latent Dynamics in Autonomous Driving Tasks. (arXiv:2205.08712v1 [cs.LG])
    Autonomous driving has received a lot of attention in the automotive industry and is often seen as the future of transportation. Passenger vehicles equipped with a wide array of sensors (e.g., cameras, front-facing radars, LiDARs, and IMUs) capable of continuous perception of the environment are becoming increasingly prevalent. These sensors provide a stream of high-dimensional, temporally correlated data that is essential for reliable autonomous driving. An autonomous driving system should effectively use the information collected from the various sensors in order to form an abstract description of the world and maintain situational awareness. Deep learning models, such as autoencoders, can be used for that purpose, as they can learn compact latent representations from a stream of incoming data. However, most autoencoder models process the data independently, without assuming any temporal interdependencies. Thus, there is a need for deep learning models that explicitly consider the temporal dependence of the data in their architecture. This work proposes CARNet, a Combined dynAmic autoencodeR NETwork architecture that utilizes an autoencoder combined with a recurrent neural network to learn the current latent representation and, in addition, also predict future latent representations in the context of autonomous driving. We demonstrate the efficacy of the proposed model in both imitation and reinforcement learning settings using both simulated and real datasets. Our results show that the proposed model outperforms the baseline state-of-the-art model, while having significantly fewer trainable parameters.  ( 2 min )
    QAPPA: Quantization-Aware Power, Performance, and Area Modeling of DNN Accelerators. (arXiv:2205.08648v1 [cs.AR])
    As the machine learning and systems community strives to achieve higher energy-efficiency through custom DNN accelerators and model compression techniques, there is a need for a design space exploration framework that incorporates quantization-aware processing elements into the accelerator design space while having accurate and fast power, performance, and area models. In this work, we present QAPPA, a highly parameterized quantization-aware power, performance, and area modeling framework for DNN accelerators. Our framework can facilitate the future research on design space exploration of DNN accelerators for various design choices such as bit precision, processing element type, scratchpad sizes of processing elements, global buffer size, device bandwidth, number of total processing elements in the the design, and DNN workloads. Our results show that different bit precisions and processing element types lead to significant differences in terms of performance per area and energy. Specifically, our proposed lightweight processing elements achieve up to 4.9x more performance per area and energy improvement when compared to INT16 based implementation.  ( 2 min )
    Evaluation of Transfer Learning for Polish with a Text-to-Text Model. (arXiv:2205.08808v1 [cs.CL])
    We introduce a new benchmark for assessing the quality of text-to-text models for Polish. The benchmark consists of diverse tasks and datasets: KLEJ benchmark adapted for text-to-text, en-pl translation, summarization, and question answering. In particular, since summarization and question answering lack benchmark datasets for the Polish language, we describe their construction and make them publicly available. Additionally, we present plT5 - a general-purpose text-to-text model for Polish that can be fine-tuned on various Natural Language Processing (NLP) tasks with a single training objective. Unsupervised denoising pre-training is performed efficiently by initializing the model weights with a multi-lingual T5 (mT5) counterpart. We evaluate the performance of plT5, mT5, Polish BART (plBART), and Polish GPT-2 (papuGaPT2). The plT5 scores top on all of these tasks except summarization, where plBART is best. In general (except for summarization), the larger the model, the better the results. The encoder-decoder architectures prove to be better than the decoder-only equivalent.  ( 2 min )
    Probability trees and the value of a single intervention. (arXiv:2205.08779v1 [cs.LG])
    The most fundamental problem in statistical causality is determining causal relationships from limited data. Probability trees, which combine prior causal structures with Bayesian updates, have been suggested as a possible solution. In this work, we quantify the information gain from a single intervention and show that both the anticipated information gain, prior to making an intervention, and the expected gain from an intervention have simple expressions. This results in an active-learning method that simply selects the intervention with the highest anticipated gain, which we illustrate through several examples. Our work demonstrates how probability trees, and Bayesian estimation of their parameters, offer a simple yet viable approach to fast causal induction.  ( 2 min )
    Markov Chain Monte Carlo for Continuous-Time Switching Dynamical Systems. (arXiv:2205.08803v1 [cs.LG])
    Switching dynamical systems are an expressive model class for the analysis of time-series data. As in many fields within the natural and engineering sciences, the systems under study typically evolve continuously in time, it is natural to consider continuous-time model formulations consisting of switching stochastic differential equations governed by an underlying Markov jump process. Inference in these types of models is however notoriously difficult, and tractable computational schemes are rare. In this work, we propose a novel inference algorithm utilizing a Markov Chain Monte Carlo approach. The presented Gibbs sampler allows to efficiently obtain samples from the exact continuous-time posterior processes. Our framework naturally enables Bayesian parameter estimation, and we also include an estimate for the diffusion covariance, which is oftentimes assumed fixed in stochastic differential equation models. We evaluate our framework under the modeling assumption and compare it against an existing variational inference approach.  ( 2 min )
    Accurate Fairness: Improving Individual Fairness without Trading Accuracy. (arXiv:2205.08704v1 [cs.LG])
    Accuracy and fairness are both crucial aspects for trustworthy machine learning. However, in practice, enhancing one aspect may sacrifice the other inevitably. We propose in this paper a new fairness criterion, accurate fairness, to assess whether an individual is treated both accurately and fairly regardless of protected attributes. We further propose new fairness metrics, fair-precision, fair-recall and fair-F1 score, to evaluate the reliability of a machine learning model from the perspective of accurate fairness. Thus, the side effects of enhancing just one of the two aspects, i.e., true bias and false fairness, can be effectively identified with our criterion. We then present a fair Siamese approach for accurate fairness training. To the best of our knowledge, this is the first time that a Siamese approach is adapted for bias mitigation. Case studies with typical fairness benchmarks demonstrate that our fair Siamese approach can, on average, promote the 17.4% higher individual fairness, the 11.5% higher fair-F1 score, and the 4.7% higher accuracy of a machine learning model than the state-of-the-art bias mitigation techniques. Finally, our approach is applied to mitigate the possible service discrimination with a real Ctrip dataset, by fairly serving on average 97.9% customers with different consumption habits who pay the same prices for the same rooms (20.7% more than original models).  ( 2 min )
    The Solvability of Interpretability Evaluation Metrics. (arXiv:2205.08696v1 [cs.LG])
    Feature attribution methods are popular for explaining neural network predictions, and they are often evaluated on metrics such as comprehensiveness and sufficiency, which are motivated by the principle that more important features -- as judged by the explanation -- should have larger impacts on model prediction. In this paper, we highlight an intriguing property of these metrics: their solvability. Concretely, we can define the problem of optimizing an explanation for a metric and solve it using beam search. This brings up the obvious question: given such solvability, why do we still develop other explainers and then evaluate them on the metric? We present a series of investigations showing that this beam search explainer is generally comparable or favorable to current choices such as LIME and SHAP, suggest rethinking the goals of model interpretability, and identify several directions towards better evaluations of new method proposals.  ( 2 min )
    A Regression Approach to Learning-Augmented Online Algorithms. (arXiv:2205.08717v1 [cs.LG])
    The emerging field of learning-augmented online algorithms uses ML techniques to predict future input parameters and thereby improve the performance of online algorithms. Since these parameters are, in general, real-valued functions, a natural approach is to use regression techniques to make these predictions. We introduce this approach in this paper, and explore it in the context of a general online search framework that captures classic problems like (generalized) ski rental, bin packing, minimum makespan scheduling, etc. We show nearly tight bounds on the sample complexity of this regression problem, and extend our results to the agnostic setting. From a technical standpoint, we show that the key is to incorporate online optimization benchmarks in the design of the loss function for the regression problem, thereby diverging from the use of off-the-shelf regression tools with standard bounds on statistical error.  ( 2 min )
    Property Unlearning: A Defense Strategy Against Property Inference Attacks. (arXiv:2205.08821v1 [cs.CR])
    During the training of machine learning models, they may store or "learn" more information about the training data than what is actually needed for the prediction or classification task. This is exploited by property inference attacks which aim at extracting statistical properties from the training data of a given model without having access to the training data itself. These properties may include the quality of pictures to identify the camera model, the age distribution to reveal the target audience of a product, or the included host types to refine a malware attack in computer networks. This attack is especially accurate when the attacker has access to all model parameters, i.e., in a white-box scenario. By defending against such attacks, model owners are able to ensure that their training data, associated properties, and thus their intellectual property stays private, even if they deliberately share their models, e.g., to train collaboratively, or if models are leaked. In this paper, we introduce property unlearning, an effective defense mechanism against white-box property inference attacks, independent of the training data type, model task, or number of properties. Property unlearning mitigates property inference attacks by systematically changing the trained weights and biases of a target model such that an adversary cannot extract chosen properties. We empirically evaluate property unlearning on three different data sets, including tabular and image data, and two types of artificial neural networks. Our results show that property unlearning is both efficient and reliable to protect machine learning models against property inference attacks, with a good privacy-utility trade-off. Furthermore, our approach indicates that this mechanism is also effective to unlearn multiple properties.  ( 2 min )
    TTAPS: Test-Time Adaption by Aligning Prototypes using Self-Supervision. (arXiv:2205.08731v1 [cs.LG])
    Nowadays, deep neural networks outperform humans in many tasks. However, if the input distribution drifts away from the one used in training, their performance drops significantly. Recently published research has shown that adapting the model parameters to the test sample can mitigate this performance degradation. In this paper, we therefore propose a novel modification of the self-supervised training algorithm SwAV that adds the ability to adapt to single test samples. Using the provided prototypes of SwAV and our derived test-time loss, we align the representation of unseen test samples with the self-supervised learned prototypes. We show the success of our method on the common benchmark dataset CIFAR10-C.  ( 2 min )
    It Isn't Sh!tposting, It's My CAT Posting. (arXiv:2205.08710v1 [cs.CV])
    In this paper, we describe a novel architecture which can generate hilarious captions for a given input image. The architecture is split into two halves, i.e. image captioning and hilarious text conversion. The architecture starts with a pre-trained CNN model, VGG16 in this implementation, and applies attention LSTM on it to generate normal caption. These normal captions then are fed forward to our hilarious text conversion transformer which converts this text into something hilarious while maintaining the context of the input image. The architecture can also be split into two halves and only the seq2seq transformer can be used to generate hilarious caption by inputting a sentence.This paper aims to help everyday user to be more lazy and hilarious at the same time by generating captions using CATNet.  ( 2 min )
    Exact Gaussian Processes for Massive Datasets via Non-Stationary Sparsity-Discovering Kernels. (arXiv:2205.09070v1 [stat.ML])
    A Gaussian Process (GP) is a prominent mathematical framework for stochastic function approximation in science and engineering applications. This success is largely attributed to the GP's analytical tractability, robustness, non-parametric structure, and natural inclusion of uncertainty quantification. Unfortunately, the use of exact GPs is prohibitively expensive for large datasets due to their unfavorable numerical complexity of $O(N^3)$ in computation and $O(N^2)$ in storage. All existing methods addressing this issue utilize some form of approximation -- usually considering subsets of the full dataset or finding representative pseudo-points that render the covariance matrix well-structured and sparse. These approximate methods can lead to inaccuracies in function approximations and often limit the user's flexibility in designing expressive kernels. Instead of inducing sparsity via data-point geometry and structure, we propose to take advantage of naturally-occurring sparsity by allowing the kernel to discover -- instead of induce -- sparse structure. The premise of this paper is that GPs, in their most native form, are often naturally sparse, but commonly-used kernels do not allow us to exploit this sparsity. The core concept of exact, and at the same time sparse GPs relies on kernel definitions that provide enough flexibility to learn and encode not only non-zero but also zero covariances. This principle of ultra-flexible, compactly-supported, and non-stationary kernels, combined with HPC and constrained optimization, lets us scale exact GPs well beyond 5 million data points.
    Bagged Polynomial Regression and Neural Networks. (arXiv:2205.08609v1 [stat.ML])
    Series and polynomial regression are able to approximate the same function classes as neural networks. However, these methods are rarely used in practice, although they offer more interpretability than neural networks. In this paper, we show that a potential reason for this is the slow convergence rate of polynomial regression estimators and propose the use of bagged polynomial regression (BPR) as an attractive alternative to neural networks. Theoretically, we derive new finite sample and asymptotic $L^2$ convergence rates for series estimators. We show that the rates can be improved in smooth settings by splitting the feature space and generating polynomial features separately for each partition. Empirically, we show that our proposed estimator, the BPR, can perform as well as more complex models with more parameters. Our estimator also performs close to state-of-the-art prediction methods in the benchmark MNIST handwritten digit dataset.  ( 2 min )
    Need is All You Need: Homeostatic Neural Networks Adapt to Concept Shift. (arXiv:2205.08645v1 [cs.LG])
    In living organisms, homeostasis is the natural regulation of internal states aimed at maintaining conditions compatible with life. Typical artificial systems are not equipped with comparable regulatory features. Here, we introduce an artificial neural network that incorporates homeostatic features. Its own computing substrate is placed in a needful and vulnerable relation to the very objects over which it computes. For example, artificial neurons performing classification of MNIST digits or Fashion-MNIST articles of clothing may receive excitatory or inhibitory effects, which alter their own learning rate as a direct result of perceiving and classifying the digits. In this scenario, accurate recognition is desirable to the agent itself because it guides decisions to regulate its vulnerable internal states and functionality. Counterintuitively, the addition of vulnerability to a learner does not necessarily impair its performance. On the contrary, self-regulation in response to vulnerability confers benefits under certain conditions. We show that homeostatic design confers increased adaptability under concept shift, in which the relationships between labels and data change over time, and that the greatest advantages are obtained under the highest rates of shift. This necessitates the rapid un-learning of past associations and the re-learning of new ones. We also demonstrate the superior abilities of homeostatic learners in environments with dynamically changing rates of concept shift. Our homeostatic design exposes the artificial neural network's thinking machinery to the consequences of its own "thoughts", illustrating the advantage of putting one's own "skin in the game" to improve fluid intelligence.  ( 2 min )
    Label-Efficient Self-Supervised Federated Learning for Tackling Data Heterogeneity in Medical Imaging. (arXiv:2205.08576v1 [cs.CV])
    The curation of large-scale medical datasets from multiple institutions necessary for training deep learning models is challenged by the difficulty in sharing patient data with privacy-preserving. Federated learning (FL), a paradigm that enables privacy-protected collaborative learning among different institutions, is a promising solution to this challenge. However, FL generally suffers from performance deterioration due to heterogeneous data distributions across institutions and the lack of quality labeled data. In this paper, we present a robust and label-efficient self-supervised FL framework for medical image analysis. Specifically, we introduce a novel distributed self-supervised pre-training paradigm into the existing FL pipeline (i.e., pre-training the models directly on the decentralized target task datasets). Built upon the recent success of Vision Transformers, we employ masked image encoding tasks for self-supervised pre-training, to facilitate more effective knowledge transfer to downstream federated models. Extensive empirical results on simulated and real-world medical imaging federated datasets show that self-supervised pre-training largely benefits the robustness of federated models against various degrees of data heterogeneity. Notably, under severe data heterogeneity, our method, without relying on any additional pre-training data, achieves an improvement of 5.06%, 1.53% and 4.58% in test accuracy on retinal, dermatology and chest X-ray classification compared with the supervised baseline with ImageNet pre-training. Moreover, we show that our self-supervised FL algorithm generalizes well to out-of-distribution data and learns federated models more effectively in limited label scenarios, surpassing the supervised baseline by 10.36% and the semi-supervised FL method by 8.3% in test accuracy.  ( 2 min )
    Variational Quantum Compressed Sensing for Joint User and Channel State Acquisition in Grant-Free Device Access Systems. (arXiv:2205.08603v1 [eess.SP])
    This paper introduces a new quantum computing framework integrated with a two-step compressed sensing technique, applied to a joint channel estimation and user identification problem. We propose a variational quantum circuit (VQC) design as a new denoising solution. For a practical grant-free communications system having correlated device activities, variational quantum parameters for Pauli rotation gates in the proposed VQC system are optimized to facilitate to the non-linear estimation. Numerical results show that the VQC method can outperform modern compressed sensing techniques using an element-wise denoiser.  ( 2 min )
    Learning Quantum Entanglement Distillation with Noisy Classical Communications. (arXiv:2205.08561v1 [quant-ph])
    Quantum networking relies on the management and exploitation of entanglement. Practical sources of entangled qubits are imperfect, producing mixed quantum state with reduced fidelity with respect to ideal Bell pairs. Therefore, an important primitive for quantum networking is entanglement distillation, whose goal is to enhance the fidelity of entangled qubits through local operations and classical communication (LOCC). Existing distillation protocols assume the availability of ideal, noiseless, communication channels. In this paper, we study the case in which communication takes place over noisy binary symmetric channels. We propose to implement local processing through parameterized quantum circuits (PQCs) that are optimized to maximize the average fidelity, while accounting for communication errors. The introduced approach, Noise Aware-LOCCNet (NA-LOCCNet), is shown to have significant advantages over existing protocols designed for noiseless communications.  ( 2 min )
    Hierarchical Distribution-Aware Testing of Deep Learning. (arXiv:2205.08589v1 [cs.SE])
    With its growing use in safety/security-critical applications, Deep Learning (DL) has raised increasing concerns regarding its dependability. In particular, DL has a notorious problem of lacking robustness. Despite recent efforts made in detecting Adversarial Examples (AEs) via state-of-the-art attacking and testing methods, they are normally input distribution agnostic and/or disregard the perception quality of AEs. Consequently, the detected AEs are irrelevant inputs in the application context or unnatural/unrealistic that can be easily noticed by humans. This may lead to a limited effect on improving the DL model's dependability, as the testing budget is likely to be wasted on detecting AEs that are encountered very rarely in its real-life operations. In this paper, we propose a new robustness testing approach for detecting AEs that considers both the input distribution and the perceptual quality of inputs. The two considerations are encoded by a novel hierarchical mechanism. First, at the feature level, the input data distribution is extracted and approximated by data compression techniques and probability density estimators. Such quantified feature level distribution, together with indicators that are highly correlated with local robustness, are considered in selecting test seeds. Given a test seed, we then develop a two-step genetic algorithm for local test case generation at the pixel level, in which two fitness functions work alternatively to control the quality of detected AEs. Finally, extensive experiments confirm that our holistic approach considering hierarchical distributions at feature and pixel levels is superior to state-of-the-arts that either disregard any input distribution or only consider a single (non-hierarchical) distribution, in terms of not only the quality of detected AEs but also improving the overall robustness of the DL model under testing.  ( 2 min )
    Frank Wolfe Meets Metric Entropy. (arXiv:2205.08634v1 [stat.ML])
    The Frank-Wolfe algorithm has seen a resurgence in popularity due to its ability to efficiently solve constrained optimization problems in machine learning and high-dimensional statistics. As such, there is much interest in establishing when the algorithm may possess a "linear" $O(\log(1/\epsilon))$ dimension-free iteration complexity comparable to projected gradient descent. In this paper, we provide a general technique for establishing domain specific and easy-to-estimate lower bounds for Frank-Wolfe and its variants using the metric entropy of the domain. Most notably, we show that a dimension-free linear upper bound must fail not only in the worst case, but in the \emph{average case}: for a Gaussian or spherical random polytope in $\mathbb{R}^d$ with $\mathrm{poly}(d)$ vertices, Frank-Wolfe requires up to $\tilde\Omega(d)$ iterations to achieve a $O(1/d)$ error bound, with high probability. We also establish this phenomenon for the nuclear norm ball. The link with metric entropy also has interesting positive implications for conditional gradient algorithms in statistics, such as gradient boosting and matching pursuit. In particular, we show that it is possible to extract fast-decaying upper bounds on the excess risk directly from an analysis of the underlying optimization procedure.  ( 2 min )
    Quantum Transfer Learning for Wi-Fi Sensing. (arXiv:2205.08590v1 [cs.LG])
    Beyond data communications, commercial-off-the-shelf Wi-Fi devices can be used to monitor human activities, track device locomotion, and sense the ambient environment. In particular, spatial beam attributes that are inherently available in the 60-GHz IEEE 802.11ad/ay standards have shown to be effective in terms of overhead and channel measurement granularity for these indoor sensing tasks. In this paper, we investigate transfer learning to mitigate domain shift in human monitoring tasks when Wi-Fi settings and environments change over time. As a proof-of-concept study, we consider quantum neural networks (QNN) as well as classical deep neural networks (DNN) for the future quantum-ready society. The effectiveness of both DNN and QNN is validated by an in-house experiment for human pose recognition, achieving greater than 90% accuracy with a limited data size.  ( 2 min )
    Classification as Direction Recovery: Improved Guarantees via Scale Invariance. (arXiv:2205.08633v1 [stat.ML])
    Modern algorithms for binary classification rely on an intermediate regression problem for computational tractability. In this paper, we establish a geometric distinction between classification and regression that allows risk in these two settings to be more precisely related. In particular, we note that classification risk depends only on the direction of the regressor, and we take advantage of this scale invariance to improve existing guarantees for how classification risk is bounded by the risk in the intermediate regression problem. Building on these guarantees, our analysis makes it possible to compare algorithms more accurately against each other and suggests viewing classification as unique from regression rather than a byproduct of it. While regression aims to converge toward the conditional expectation function in location, we propose that classification should instead aim to recover its direction.  ( 2 min )
    Strategizing against Learners in Bayesian Games. (arXiv:2205.08562v1 [cs.LG])
    We study repeated two-player games where one of the players, the learner, employs a no-regret learning strategy, while the other, the optimizer, is a rational utility maximizer. We consider general Bayesian games, where the payoffs of both the optimizer and the learner could depend on the type, which is drawn from a publicly known distribution, but revealed privately to the learner. We address the following questions: (a) what is the bare minimum that the optimizer can guarantee to obtain regardless of the no-regret learning algorithm employed by the learner? (b) are there learning algorithms that cap the optimizer payoff at this minimum? (c) can these algorithms be implemented efficiently? While building this theory of optimizer-learner interactions, we define a new combinatorial notion of regret called polytope swap regret, that could be of independent interest in other settings.  ( 2 min )
    All-Photonic Artificial Neural Network Processor Via Non-linear Optics. (arXiv:2205.08608v1 [physics.optics])
    Optics and photonics has recently captured interest as a platform to accelerate linear matrix processing, that has been deemed as a bottleneck in traditional digital electronic architectures. In this paper, we propose an all-photonic artificial neural network processor wherein information is encoded in the amplitudes of frequency modes that act as neurons. The weights among connected layers are encoded in the amplitude of controlled frequency modes that act as pumps. Interaction among these modes for information processing is enabled by non-linear optical processes. Both the matrix multiplication and element-wise activation functions are performed through coherent processes, enabling the direct representation of negative and complex numbers without the use of detectors or digital electronics. Via numerical simulations, we show that our design achieves a performance commensurate with present-day state-of-the-art computational networks on image-classification benchmarks. Our architecture is unique in providing a completely unitary, reversible mode of computation. Additionally, the computational speed increases with the power of the pumps to arbitrarily high rates, as long as the circuitry can sustain the higher optical power.  ( 2 min )
    Multibit Tries Packet Classification with Deep Reinforcement Learning. (arXiv:2205.08606v1 [cs.NI])
    High performance packet classification is a key component to support scalable network applications like firewalls, intrusion detection, and differentiated services. With ever increasing in the line-rate in core networks, it becomes a great challenge to design a scalable and high performance packet classification solution using hand-tuned heuristics approaches. In this paper, we present a scalable learning-based packet classification engine and its performance evaluation. By exploiting the sparsity of ruleset, our algorithm uses a few effective bits (EBs) to extract a large number of candidate rules with just a few of memory access. These effective bits are learned with deep reinforcement learning and they are used to create a bitmap to filter out the majority of rules which do not need to be full-matched to improve the online system performance. Moreover, our EBs learning-based selection method is independent of the ruleset, which can be applied to varying rulesets. Our multibit tries classification engine outperforms lookup time both in worst and average case by 55% and reduce memory footprint, compared to traditional decision tree without EBs.  ( 2 min )
    A graph representation of molecular ensembles for polymer property prediction. (arXiv:2205.08619v1 [cs.LG])
    Synthetic polymers are versatile and widely used materials. Similar to small organic molecules, a large chemical space of such materials is hypothetically accessible. Computational property prediction and virtual screening can accelerate polymer design by prioritizing candidates expected to have favorable properties. However, in contrast to organic molecules, polymers are often not well-defined single structures but an ensemble of similar molecules, which poses unique challenges to traditional chemical representations and machine learning approaches. Here, we introduce a graph representation of molecular ensembles and an associated graph neural network architecture that is tailored to polymer property prediction. We demonstrate that this approach captures critical features of polymeric materials, like chain architecture, monomer stoichiometry, and degree of polymerization, and achieves superior accuracy to off-the-shelf cheminformatics methodologies. While doing so, we built a dataset of simulated electron affinity and ionization potential values for >40k polymers with varying monomer composition, stoichiometry, and chain architecture, which may be used in the development of other tailored machine learning approaches. The dataset and machine learning models presented in this work pave the path toward new classes of algorithms for polymer informatics and, more broadly, introduce a framework for the modeling of molecular ensembles.  ( 2 min )
    Generic and Trend-aware Curriculum Learning for Relation Extraction in Graph Neural Networks. (arXiv:2205.08625v1 [cs.CL])
    We present a generic and trend-aware curriculum learning approach for graph neural networks. It extends existing approaches by incorporating sample-level loss trends to better discriminate easier from harder samples and schedule them for training. The model effectively integrates textual and structural information for relation extraction in text graphs. Experimental results show that the model provides robust estimations of sample difficulty and shows sizable improvement over the state-of-the-art approaches across several datasets.  ( 2 min )
    Learning to Learn Quantum Turbo Detection. (arXiv:2205.08611v1 [eess.SP])
    This paper investigates a turbo receiver employing a variational quantum circuit (VQC). The VQC is configured with an ansatz of the quantum approximate optimization algorithm (QAOA). We propose a 'learning to learn' (L2L) framework to optimize the turbo VQC decoder such that high fidelity soft-decision output is generated. Besides demonstrating the proposed algorithm's computational complexity, we show that the L2L VQC turbo decoder can achieve an excellent performance close to the optimal maximum-likelihood performance in a multiple-input multiple-output system.  ( 2 min )
    Deep Neural Network Classifier for Multi-dimensional Functional Data. (arXiv:2205.08592v1 [stat.ML])
    We propose a new approach, called as functional deep neural network (FDNN), for classifying multi-dimensional functional data. Specifically, a deep neural network is trained based on the principle components of the training data which shall be used to predict the class label of a future data function. Unlike the popular functional discriminant analysis approaches which rely on Gaussian assumption, the proposed FDNN approach applies to general non-Gaussian multi-dimensional functional data. Moreover, when the log density ratio possesses a locally connected functional modular structure, we show that FDNN achieves minimax optimality. The superiority of our approach is demonstrated through both simulated and real-world datasets.  ( 2 min )
    CV4Code: Sourcecode Understanding via Visual Code Representations. (arXiv:2205.08585v1 [cs.SE])
    We present CV4Code, a compact and effective computer vision method for sourcecode understanding. Our method leverages the contextual and the structural information available from the code snippet by treating each snippet as a two-dimensional image, which naturally encodes the context and retains the underlying structural information through an explicit spatial representation. To codify snippets as images, we propose an ASCII codepoint-based image representation that facilitates fast generation of sourcecode images and eliminates redundancy in the encoding that would arise from an RGB pixel representation. Furthermore, as sourcecode is treated as images, neither lexical analysis (tokenisation) nor syntax tree parsing is required, which makes the proposed method agnostic to any particular programming language and lightweight from the application pipeline point of view. CV4Code can even featurise syntactically incorrect code which is not possible from methods that depend on the Abstract Syntax Tree (AST). We demonstrate the effectiveness of CV4Code by learning Convolutional and Transformer networks to predict the functional task, i.e. the problem it solves, of the source code directly from its two-dimensional representation, and using an embedding from its latent space to derive a similarity score of two code snippets in a retrieval setup. Experimental results show that our approach achieves state-of-the-art performance in comparison to other methods with the same task and data configurations. For the first time we show the benefits of treating sourcecode understanding as a form of image processing task.  ( 2 min )
    OneAligner: Zero-shot Cross-lingual Transfer with One Rich-Resource Language Pair for Low-Resource Sentence Retrieval. (arXiv:2205.08605v1 [cs.CL])
    Aligning parallel sentences in multilingual corpora is essential to curating data for downstream applications such as Machine Translation. In this work, we present OneAligner, an alignment model specially designed for sentence retrieval tasks. This model is able to train on only one language pair and transfers, in a cross-lingual fashion, to low-resource language pairs with negligible degradation in performance. When trained with all language pairs of a large-scale parallel multilingual corpus (OPUS-100), this model achieves the state-of-the-art result on the Tateoba dataset, outperforming an equally-sized previous model by 8.0 points in accuracy while using less than 0.6% of their parallel data. When finetuned on a single rich-resource language pair, be it English-centered or not, our model is able to match the performance of the ones finetuned on all language pairs under the same data budget with less than 2.0 points decrease in accuracy. Furthermore, with the same setup, scaling up the number of rich-resource language pairs monotonically improves the performance, reaching a minimum of 0.4 points discrepancy in accuracy, making it less mandatory to collect any low-resource parallel data. Finally, we conclude through empirical results and analyses that the performance of the sentence alignment task depends mostly on the monolingual and parallel data size, up to a certain size threshold, rather than on what language pairs are used for training or evaluation.  ( 2 min )
    The Power of Reuse: A Multi-Scale Transformer Model for Structural Dynamic Segmentation in Symbolic Music Generation. (arXiv:2205.08579v1 [cs.SD])
    Symbolic Music Generation relies on the contextual representation capabilities of the generative model, where the most prevalent approach is the Transformer-based model. Not only that, the learning of long-term context is also related to the dynamic segmentation of musical structures, i.e. intro, verse and chorus, which is currently overlooked by the research community. In this paper, we propose a multi-scale Transformer, which uses coarse-decoder and fine-decoders to model the contexts at the global and section-level, respectively. Concretely, we designed a Fragment Scope Localization layer to syncopate the music into sections, which were later used to pre-train fine-decoders. After that, we designed a Music Style Normalization layer to transfer the style information from the original sections to the generated sections to achieve consistency in music style. The generated sections are combined in the aggregation layer and fine-tuned by the coarse decoder. Our model is evaluated on two open MIDI datasets, and experiments show that our model outperforms the best contemporary symbolic music generative models. More excitingly, visual evaluation shows that our model is superior in melody reuse, resulting in more realistic music.  ( 2 min )
    Universal characteristics of deep neural network loss surfaces from random matrix theory. (arXiv:2205.08601v1 [math-ph])
    This paper considers several aspects of random matrix universality in deep neural networks. Motivated by recent experimental work, we use universal properties of random matrices related to local statistics to derive practical implications for deep neural networks based on a realistic model of their Hessians. In particular we derive universal aspects of outliers in the spectra of deep neural networks and demonstrate the important role of random matrix local laws in popular pre-conditioning gradient descent algorithms. We also present insights into deep neural network loss surfaces from quite general arguments based on tools from statistical physics and random matrix theory.  ( 2 min )
  • Open

    A Unified Linear Speedup Analysis of Stochastic FedAvg and Nesterov Accelerated FedAvg. (arXiv:2007.05690v3 [cs.LG] UPDATED)
    Federated learning (FL) learns a model jointly from a set of participating devices without sharing each other's privately held data. The characteristics of non-i.i.d. data across the network, low device participation, high communication costs, and the mandate that data remain private bring challenges in understanding the convergence of FL algorithms, particularly with regards to how convergence scales with the number of participating devices. In this paper, we focus on Federated Averaging (FedAvg)--arguably the most popular and effective FL algorithm class in use today--and provide a unified and comprehensive study of its convergence rate. Although FedAvg has recently been studied by an emerging line of literature, a systematic study of how FedAvg's convergence scales with the number of participating devices in the fully heterogeneous FL setting is lacking--a crucial issue whose answer would shed light on the performance of FedAvg in large FL systems in practice. We fill this gap by providing a unified analysis that establishes convergence guarantees for FedAvg under strongly convex smooth, convex smooth problems, and overparameterized strongly convex smooth problems. We show that FedAvg enjoys linear speedup in each case, although with different convergence rates and communication efficiencies. While there have been linear speedup results from distributed optimization that assumes full participation, ours are the first to establish linear speedup for FedAvg under both statistical and system heterogeneity. For strongly convex and convex problems, we also characterize the corresponding convergence rates for the Nesterov accelerated FedAvg algorithm, which are the first linear speedup guarantees for momentum variants of FedAvg in convex settings. Empirical studies of the algorithms in various settings have supported our theoretical results.
    Maslow's Hammer for Catastrophic Forgetting: Node Re-Use vs Node Activation. (arXiv:2205.09029v1 [stat.ML])
    Continual learning - learning new tasks in sequence while maintaining performance on old tasks - remains particularly challenging for artificial neural networks. Surprisingly, the amount of forgetting does not increase with the dissimilarity between the learned tasks, but appears to be worst in an intermediate similarity regime. In this paper we theoretically analyse both a synthetic teacher-student framework and a real data setup to provide an explanation of this phenomenon that we name Maslow's hammer hypothesis. Our analysis reveals the presence of a trade-off between node activation and node re-use that results in worst forgetting in the intermediate regime. Using this understanding we reinterpret popular algorithmic interventions for catastrophic interference in terms of this trade-off, and identify the regimes in which they are most effective.
    Dependent Latent Class Models. (arXiv:2205.08677v1 [stat.ML])
    Latent Class Models (LCMs) are used to cluster multivariate categorical data (e.g. group participants based on survey responses). Traditional LCMs assume a property called conditional independence. This assumption can be restrictive, leading to model misspecification and overparameterization. To combat this problem, we developed a novel Bayesian model called a Dependent Latent Class Model (DLCM), which permits conditional dependence. We verify identifiability of DLCMs. We also demonstrate the effectiveness of DLCMs in both simulations and real-world applications. Compared to traditional LCMs, DLCMs are effective in applications with time series, overlapping items, and structural zeroes.
    Bayesian Discrete Conditional Transformation Models. (arXiv:2205.08594v1 [stat.ME])
    We propose a novel Bayesian model framework for discrete ordinal and count data based on conditional transformations of the responses. The conditional transformation function is estimated from the data in conjunction with an a priori chosen reference distribution. For count responses, the resulting transformation model is novel in the sense that it is a Bayesian fully parametric yet distribution-free approach that can additionally account for excess zeros with additive transformation function specifications. For ordinal categoric responses, our cumulative link transformation model allows the inclusion of linear and nonlinear covariate effects that can additionally be made category-specific, resulting in (non-)proportional odds or hazards models and more, depending on the choice of the reference distribution. Inference is conducted by a generic modular Markov chain Monte Carlo algorithm where multivariate Gaussian priors enforce specific properties such as smoothness on the functional effects. To illustrate the versatility of Bayesian discrete conditional transformation models, applications to counts of patent citations in the presence of excess zeros and on treating forest health categories in a discrete partial proportional odds model are presented.
    Deep Neural Network Classifier for Multi-dimensional Functional Data. (arXiv:2205.08592v1 [stat.ML])
    We propose a new approach, called as functional deep neural network (FDNN), for classifying multi-dimensional functional data. Specifically, a deep neural network is trained based on the principle components of the training data which shall be used to predict the class label of a future data function. Unlike the popular functional discriminant analysis approaches which rely on Gaussian assumption, the proposed FDNN approach applies to general non-Gaussian multi-dimensional functional data. Moreover, when the log density ratio possesses a locally connected functional modular structure, we show that FDNN achieves minimax optimality. The superiority of our approach is demonstrated through both simulated and real-world datasets.
    Frank Wolfe Meets Metric Entropy. (arXiv:2205.08634v1 [stat.ML])
    The Frank-Wolfe algorithm has seen a resurgence in popularity due to its ability to efficiently solve constrained optimization problems in machine learning and high-dimensional statistics. As such, there is much interest in establishing when the algorithm may possess a "linear" $O(\log(1/\epsilon))$ dimension-free iteration complexity comparable to projected gradient descent. In this paper, we provide a general technique for establishing domain specific and easy-to-estimate lower bounds for Frank-Wolfe and its variants using the metric entropy of the domain. Most notably, we show that a dimension-free linear upper bound must fail not only in the worst case, but in the \emph{average case}: for a Gaussian or spherical random polytope in $\mathbb{R}^d$ with $\mathrm{poly}(d)$ vertices, Frank-Wolfe requires up to $\tilde\Omega(d)$ iterations to achieve a $O(1/d)$ error bound, with high probability. We also establish this phenomenon for the nuclear norm ball. The link with metric entropy also has interesting positive implications for conditional gradient algorithms in statistics, such as gradient boosting and matching pursuit. In particular, we show that it is possible to extract fast-decaying upper bounds on the excess risk directly from an analysis of the underlying optimization procedure.
    Conformalized Online Learning: Online Calibration Without a Holdout Set. (arXiv:2205.09095v1 [cs.LG])
    We develop a framework for constructing uncertainty sets with a valid coverage guarantee in an online setting, in which the underlying data distribution can drastically -- and even adversarially -- shift over time. The technique we propose is highly flexible as it can be integrated with any online learning algorithm, requiring minimal implementation effort and computational cost. A key advantage of our method over existing alternatives -- which also build on conformal inference -- is that we do not need to split the data into training and holdout calibration sets. This allows us to fit the predictive model in a fully online manner, utilizing the most recent observation for constructing calibrated uncertainty sets. Consequently, and in contrast with existing techniques, (i) the sets we build can quickly adapt to new changes in the distribution; and (ii) our procedure does not require refitting the model at each time step. Using synthetic and real-world benchmark data sets, we demonstrate the validity of our theory and the improved performance of our proposal over existing techniques. To demonstrate the greater flexibility of the proposed method, we show how to construct valid intervals for a multiple-output regression problem that previous sequential calibration methods cannot handle due to impractical computational and memory requirements.
    Dynamic Predictions of Postoperative Complications from Explainable, Uncertainty-Aware, and Multi-Task Deep Neural Networks. (arXiv:2004.12551v2 [cs.LG] UPDATED)
    Accurate prediction of postoperative complications can inform shared decisions regarding prognosis, preoperative risk-reduction, and postoperative resource use. We hypothesized that multi-task deep learning models would outperform random forest models in predicting postoperative complications, and that integrating high-resolution intraoperative physiological time series would result in more granular and personalized health representations that would improve prognostication compared to preoperative predictions. In a longitudinal cohort study of 56,242 patients undergoing 67,481 inpatient surgical procedures at a university medical center, we compared deep learning models with random forests for predicting nine common postoperative complications using preoperative, intraoperative, and perioperative patient data. Our study indicated several significant results across experimental settings that suggest the utility of deep learning for capturing more precise representations of patient health for augmented surgical decision support. Multi-task learning improved efficiency by reducing computational resources without compromising predictive performance. Integrated gradients interpretability mechanisms identified potentially modifiable risk factors for each complication. Monte Carlo dropout methods provided a quantitative measure of prediction uncertainty that has the potential to enhance clinical trust. Multi-task learning, interpretability mechanisms, and uncertainty metrics demonstrated potential to facilitate effective clinical implementation.
    A simple yet effective baseline for non-attributed graph classification. (arXiv:1811.03508v3 [cs.LG] UPDATED)
    Graphs are complex objects that do not lend themselves easily to typical learning tasks. Recently, a range of approaches based on graph kernels or graph neural networks have been developed for graph classification and for representation learning on graphs in general. As the developed methodologies become more sophisticated, it is important to understand which components of the increasingly complex methods are necessary or most effective. As a first step, we develop a simple yet meaningful graph representation, and explore its effectiveness in graph classification. We test our baseline representation for the graph classification task on a range of graph datasets. Interestingly, this simple representation achieves similar performance as the state-of-the-art graph kernels and graph neural networks for non-attributed graph classification. Its performance on classifying attributed graphs is slightly weaker as it does not incorporate attributes. However, given its simplicity and efficiency, we believe that it still serves as an effective baseline for attributed graph classification. Our graph representation is efficient (linear-time) to compute. We also provide a simple connection with the graph neural networks. Note that these observations are only for the task of graph classification while existing methods are often designed for a broader scope including node embedding and link prediction. The results are also likely biased due to the limited amount of benchmark datasets available. Nevertheless, the good performance of our simple baseline calls for the development of new, more comprehensive benchmark datasets so as to better evaluate and analyze different graph learning methods. Furthermore, given the computational efficiency of our graph summary, we believe that it is a good candidate as a baseline method for future graph classification (or even other graph learning) studies.
    Variational autoencoders in the presence of low-dimensional data: landscape and implicit bias. (arXiv:2112.06868v2 [cs.LG] UPDATED)
    Variational Autoencoders are one of the most commonly used generative models, particularly for image data. A prominent difficulty in training VAEs is data that is supported on a lower-dimensional manifold. Recent work by Dai and Wipf (2020) proposes a two-stage training algorithm for VAEs, based on a conjecture that in standard VAE training the generator will converge to a solution with 0 variance which is correctly supported on the ground truth manifold. They gave partial support for that conjecture by showing that some optima of the VAE loss do satisfy this property, but did not analyze the training dynamics. In this paper, we show that for linear encoders/decoders, the conjecture is true-that is the VAE training does recover a generator with support equal to the ground truth manifold-and does so due to an implicit bias of gradient descent rather than merely the VAE loss itself. In the nonlinear case, we show that VAE training frequently learns a higher-dimensional manifold which is a superset of the ground truth manifold.
    Doubly Robust Collaborative Targeted Learning for Debiased Recommendations. (arXiv:2203.10258v2 [cs.IR] UPDATED)
    In recommender systems, the collected data always contains various biases and leads to the challenge of accurate predictions. To address selection bias and confounding bias, the doubly robust (DR) method and its variants show superior performance due to the double robustness property and smaller bias under inaccurate propensity and error imputation models. However, we theoretically show that the variance of the error imputation-based (EIB) method is much smaller than that of DR, although EIB may suffer from a much larger bias. In this paper, we propose a doubly robust targeted learning method that effectively combines the small-bias property of DR and the small-variance property of EIB, by leveraging the targeted maximum likelihood estimation technique. Theoretical analysis shows that the proposed targeted learning is effective in reducing the variance of DR while maintaining double robustness. To further reduce the bias and variance during the training process, we propose a novel collaborative targeted learning approach that decomposes imputed errors into parametric and nonparametric parts and updates them collaboratively, resulting in more accurate predictions. Both theoretical analysis and experiments demonstrate the superiority of the proposed methods compared with existing debiasing methods.
    Marginal and Joint Cross-Entropies & Predictives for Online Bayesian Inference, Active Learning, and Active Sampling. (arXiv:2205.08766v1 [cs.LG])
    Principled Bayesian deep learning (BDL) does not live up to its potential when we only focus on marginal predictive distributions (marginal predictives). Recent works have highlighted the importance of joint predictives for (Bayesian) sequential decision making from a theoretical and synthetic perspective. We provide additional practical arguments grounded in real-world applications for focusing on joint predictives: we discuss online Bayesian inference, which would allow us to make predictions while taking into account additional data without retraining, and we propose new challenging evaluation settings using active learning and active sampling. These settings are motivated by an examination of marginal and joint predictives, their respective cross-entropies, and their place in offline and online learning. They are more realistic than previously suggested ones, building on work by Wen et al. (2021) and Osband et al. (2022), and focus on evaluating the performance of approximate BNNs in an online supervised setting. Initial experiments, however, raise questions on the feasibility of these ideas in high-dimensional parameter spaces with current BDL inference techniques, and we suggest experiments that might help shed further light on the practicality of current research for these problems. Importantly, our work highlights previously unidentified gaps in current research and the need for better approximate joint predictives.
    On the Effective Number of Linear Regions in Shallow Univariate ReLU Networks: Convergence Guarantees and Implicit Bias. (arXiv:2205.09072v1 [cs.LG])
    We study the dynamics and implicit bias of gradient flow (GF) on univariate ReLU neural networks with a single hidden layer in a binary classification setting. We show that when the labels are determined by the sign of a target network with $r$ neurons, with high probability over the initialization of the network and the sampling of the dataset, GF converges in direction (suitably defined) to a network achieving perfect training accuracy and having at most $\mathcal{O}(r)$ linear regions, implying a generalization bound. Our result may already hold for mild over-parameterization, where the width is $\tilde{\mathcal{O}}(r)$ and independent of the sample size.
    A Central Limit Theorem, Loss Aversion and Multi-Armed Bandits. (arXiv:2106.05472v2 [math.PR] UPDATED)
    This paper studies a multi-armed bandit problem where the decision-maker is loss averse, in particular she is risk averse in the domain of gains and risk loving in the domain of losses. The focus is on large horizons. Consequences of loss aversion for asymptotic (large horizon) properties are derived in a number of analytical results. The analysis is based on a new central limit theorem for a set of measures under which conditional variances can vary in a largely unstructured history-dependent way subject only to the restriction that they lie in a fixed interval.
    GPU-accelerated partially linear multiuser detection for 5G and beyond URLLC systems. (arXiv:2201.05024v3 [eess.SP] UPDATED)
    In this feasibility study, we have implemented a recently proposed partially linear multiuser detection algorithm in reproducing kernel Hilbert spaces (RKHSs) on a GPU-accelerated platform. Partially linear multiuser detection, which combines the robustness of linear detection with the power of nonlinear methods, has been proposed for a massive connectivity scenario with the non-orthogonal multiple access (NOMA). This is a promising approach, but detecting payloads within a received orthogonal frequency division multiplexing (OFDM) radio frame requires the execution of a large number of inner product operations, which are the main computational burden of the algorithm. Although inner-product operations consist of simple kernel evaluations, their vast number poses a challenge in ultra-low latency (ULL) applications, because the time needed for computing the inner products might exceed the sub-millisecond latency requirement. To address this problem, this study demonstrates the acceleration of the inner-product operations through massive parallelization. The result is a GPU-accelerated real-time OFDM receiver that enables sub-millisecond latency detection to meet the requirements of 5th generation (5G) and beyond ultra-reliable and low latency communications (URLLC) systems. Moreover, the parallelization and acceleration techniques explored and demonstrated in this study can be extended to many other signal processing algorithms in Hilbert spaces, such as those based on projection onto convex sets (POCS) and adaptive projected subgradient method (APSM) algorithms. Experimental results and comparisons with the state-of-art confirm the effectiveness of our techniques.
    New Lower Bounds for Private Estimation and a Generalized Fingerprinting Lemma. (arXiv:2205.08532v2 [cs.DS] UPDATED)
    We prove new lower bounds for statistical estimation tasks under the constraint of $(\varepsilon, \delta)$-differential privacy. First, we provide tight lower bounds for private covariance estimation of Gaussian distributions. We show that estimating the covariance matrix in Frobenius norm requires $\Omega(d^2)$ samples, and in spectral norm requires $\Omega(d^{3/2})$ samples, both matching upper bounds up to logarithmic factors. We prove these bounds via our main technical contribution, a broad generalization of the fingerprinting method to exponential families. Additionally, using the private Assouad method of Acharya, Sun, and Zhang, we show a tight $\Omega(d/(\alpha^2 \varepsilon))$ lower bound for estimating the mean of a distribution with bounded covariance to $\alpha$-error in $\ell_2$-distance. Prior known lower bounds for all these problems were either polynomially weaker or held under the stricter condition of $(\varepsilon,0)$-differential privacy.
    Fair and Green Hyperparameter Optimization via Multi-objective and Multiple Information Source Bayesian Optimization. (arXiv:2205.08835v1 [cs.LG])
    There is a consensus that focusing only on accuracy in searching for optimal machine learning models amplifies biases contained in the data, leading to unfair predictions and decision supports. Recently, multi-objective hyperparameter optimization has been proposed to search for machine learning models which offer equally Pareto-efficient trade-offs between accuracy and fairness. Although these approaches proved to be more versatile than fairness-aware machine learning algorithms -- which optimize accuracy constrained to some threshold on fairness -- they could drastically increase the energy consumption in the case of large datasets. In this paper we propose FanG-HPO, a Fair and Green Hyperparameter Optimization (HPO) approach based on both multi-objective and multiple information source Bayesian optimization. FanG-HPO uses subsets of the large dataset (aka information sources) to obtain cheap approximations of both accuracy and fairness, and multi-objective Bayesian Optimization to efficiently identify Pareto-efficient machine learning models. Experiments consider two benchmark (fairness) datasets and two machine learning algorithms (XGBoost and Multi-Layer Perceptron), and provide an assessment of FanG-HPO against both fairness-aware machine learning algorithms and hyperparameter optimization via a multi-objective single-source optimization algorithm in BoTorch, a state-of-the-art platform for Bayesian Optimization.
    Meta-Learning Sparse Compression Networks. (arXiv:2205.08957v1 [stat.ML])
    Recent work in Deep Learning has re-imagined the representation of data as functions mapping from a coordinate space to an underlying continuous signal. When such functions are approximated by neural networks this introduces a compelling alternative to the more common multi-dimensional array representation. Recent work on such Implicit Neural Representations (INRs) has shown that - following careful architecture search - INRs can outperform established compression methods such as JPEG (e.g. Dupont et al., 2021). In this paper, we propose crucial steps towards making such ideas scalable: Firstly, we employ stateof-the-art network sparsification techniques to drastically improve compression. Secondly, introduce the first method allowing for sparsification to be employed in the inner-loop of commonly used Meta-Learning algorithms, drastically improving both compression and the computational cost of learning INRs. The generality of this formalism allows us to present results on diverse data modalities such as images, manifolds, signed distance functions, 3D shapes and scenes, several of which establish new state-of-the-art results.
    Ranking of Communities in Multiplex Spatiotemporal Models of Brain Dynamics. (arXiv:2203.09281v2 [q-bio.NC] UPDATED)
    As a relatively new field, network neuroscience has tended to focus on aggregate behaviours of the brain averaged over many successive experiments or over long recordings in order to construct robust brain models. These models are limited in their ability to explain dynamic state changes in the brain which occurs spontaneously as a result of normal brain function. Hidden Markov Models (HMMs) trained on neuroimaging time series data have since arisen as a method to produce dynamical models that are easy to train but can be difficult to fully parametrise or analyse. We propose an interpretation of these neural HMMs as multiplex brain state graph models we term Hidden Markov Graph Models (HMGMs). This interpretation allows for dynamic brain activity to be analysed using the full repertoire of network analysis techniques. Furthermore, we propose a general method for selecting HMM hyperparameters in the absence of external data, based on the principle of maximum entropy, and use this to select the number of layers in the multiplex model. We produce a new tool for determining important communities of brain regions using a spatiotemporal random walk-based procedure that takes advantage of the underlying Markov structure of the model. Our analysis of real multi-subject fMRI data provides new results that corroborate the modular processing hypothesis of the brain at rest as well as contributing new evidence of functional overlap between and within dynamic brain state communities. Our analysis pipeline provides a way to characterise dynamic network activity of the brain under novel behaviours or conditions.
    On the Efficiency of Entropic Regularized Algorithms for Optimal Transport. (arXiv:1906.01437v9 [cs.DS] UPDATED)
    We present several new complexity results for the entropic regularized algorithms that approximately solve the optimal transport (OT) problem between two discrete probability measures with at most $n$ atoms. First, we improve the complexity bound of a greedy variant of Sinkhorn, known as \textit{Greenkhorn}, from $\widetilde{O}(n^2\varepsilon^{-3})$ to $\widetilde{O}(n^2\varepsilon^{-2})$. Notably, our result can match the best known complexity bound of Sinkhorn and help clarify why Greenkhorn significantly outperforms Sinkhorn in practice in terms of row/column updates as observed by~\citet{Altschuler-2017-Near}. Second, we propose a new algorithm, which we refer to as \textit{APDAMD} and which generalizes an adaptive primal-dual accelerated gradient descent (APDAGD) algorithm~\citep{Dvurechensky-2018-Computational} with a prespecified mirror mapping $\phi$. We prove that APDAMD achieves the complexity bound of $\widetilde{O}(n^2\sqrt{\delta}\varepsilon^{-1})$ in which $\delta>0$ stands for the regularity of $\phi$. In addition, we show by a counterexample that the complexity bound of $\widetilde{O}(\min\{n^{9/4}\varepsilon^{-1}, n^2\varepsilon^{-2}\})$ proved for APDAGD before is invalid and give a refined complexity bound of $\widetilde{O}(n^{5/2}\varepsilon^{-1})$. Further, we develop a \textit{deterministic} accelerated variant of Sinkhorn via appeal to estimated sequence and prove the complexity bound of $\widetilde{O}(n^{7/3}\varepsilon^{-4/3})$. As such, we see that accelerated variant of Sinkhorn outperforms Sinkhorn and Greenkhorn in terms of $1/\varepsilon$ and APDAGD and accelerated alternating minimization (AAM)~\citep{Guminov-2021-Combination} in terms of $n$. Finally, we conduct the experiments on synthetic and real data and the numerical results show the efficiency of Greenkhorn, APDAMD and accelerated Sinkhorn in practice.
    Classification as Direction Recovery: Improved Guarantees via Scale Invariance. (arXiv:2205.08633v1 [stat.ML])
    Modern algorithms for binary classification rely on an intermediate regression problem for computational tractability. In this paper, we establish a geometric distinction between classification and regression that allows risk in these two settings to be more precisely related. In particular, we note that classification risk depends only on the direction of the regressor, and we take advantage of this scale invariance to improve existing guarantees for how classification risk is bounded by the risk in the intermediate regression problem. Building on these guarantees, our analysis makes it possible to compare algorithms more accurately against each other and suggests viewing classification as unique from regression rather than a byproduct of it. While regression aims to converge toward the conditional expectation function in location, we propose that classification should instead aim to recover its direction.
    The Kernelized Taylor Diagram. (arXiv:2205.08864v1 [stat.ML])
    This paper presents the kernelized Taylor diagram, a graphical framework for visualizing similarities between data populations. The kernelized Taylor diagram builds on the widely used Taylor diagram, which is used to visualize similarities between populations. However, the Taylor diagram has several limitations such as not capturing non-linear relationships and sensitivity to outliers. To address such limitations, we propose the kernelized Taylor diagram. Our proposed kernelized Taylor diagram is capable of visualizing similarities between populations with minimal assumptions of the data distributions. The kernelized Taylor diagram relates the maximum mean discrepancy and the kernel mean embedding in a single diagram, a construction that, to the best of our knowledge, have not been devised prior to this work. We believe that the kernelized Taylor diagram can be a valuable tool in data visualization.
    Exact Gaussian Processes for Massive Datasets via Non-Stationary Sparsity-Discovering Kernels. (arXiv:2205.09070v1 [stat.ML])
    A Gaussian Process (GP) is a prominent mathematical framework for stochastic function approximation in science and engineering applications. This success is largely attributed to the GP's analytical tractability, robustness, non-parametric structure, and natural inclusion of uncertainty quantification. Unfortunately, the use of exact GPs is prohibitively expensive for large datasets due to their unfavorable numerical complexity of $O(N^3)$ in computation and $O(N^2)$ in storage. All existing methods addressing this issue utilize some form of approximation -- usually considering subsets of the full dataset or finding representative pseudo-points that render the covariance matrix well-structured and sparse. These approximate methods can lead to inaccuracies in function approximations and often limit the user's flexibility in designing expressive kernels. Instead of inducing sparsity via data-point geometry and structure, we propose to take advantage of naturally-occurring sparsity by allowing the kernel to discover -- instead of induce -- sparse structure. The premise of this paper is that GPs, in their most native form, are often naturally sparse, but commonly-used kernels do not allow us to exploit this sparsity. The core concept of exact, and at the same time sparse GPs relies on kernel definitions that provide enough flexibility to learn and encode not only non-zero but also zero covariances. This principle of ultra-flexible, compactly-supported, and non-stationary kernels, combined with HPC and constrained optimization, lets us scale exact GPs well beyond 5 million data points.
    Distribution-free Prediction Sets Adaptive to Unknown Covariate Shift. (arXiv:2203.06126v3 [stat.ME] UPDATED)
    Predicting sets of outcomes -- instead of unique outcomes -- is a promising solution to uncertainty quantification in statistical learning. Despite a rich literature on constructing prediction sets with statistical guarantees, adapting to unknown covariate shift -- a prevalent issue in practice -- poses a serious challenge and has yet to be fully solved. In this paper, we propose a novel flexible distribution-free method, PredSet-1Step, to construct prediction sets that can efficiently adapt to unknown covariate shift. We formally show that our method is \textit{asymptotically probably approximately correct}, having well-calibrated coverage error with high confidence for large samples. We illustrate that it achieves nominal coverage in a number of experiments and a data set concerning HIV risk prediction in a South African cohort study. Our theory hinges on a new bound for the convergence rate of the coverage of Wald confidence intervals based on general asymptotically linear estimators. This is a technical tool of independent interest.
    Incorporating Prior Knowledge into Neural Networks through an Implicit Composite Kernel. (arXiv:2205.07384v2 [cs.LG] UPDATED)
    It is challenging to guide neural network (NN) learning with prior knowledge. In contrast, many known properties, such as spatial smoothness or seasonality, are straightforward to model by choosing an appropriate kernel in a Gaussian process (GP). Many deep learning applications could be enhanced by modeling such known properties. For example, convolutional neural networks (CNNs) are frequently used in remote sensing, which is subject to strong seasonal effects. We propose to blend the strengths of deep learning and the clear modeling capabilities of GPs by using a composite kernel that combines a kernel implicitly defined by a neural network with a second kernel function chosen to model known properties (e.g., seasonality). Then, we approximate the resultant GP by combining a deep network and an efficient mapping based on the Nystrom approximation, which we call Implicit Composite Kernel (ICK). ICK is flexible and can be used to include prior information in neural networks in many applications. We demonstrate the strength of our framework by showing its superior performance and flexibility on both synthetic and real-world data sets. The code is available at: https://anonymous.4open.science/r/ICK_NNGP-17C5/.
    Greedy Actor-Critic: A New Conditional Cross-Entropy Method for Policy Improvement. (arXiv:1810.09103v3 [cs.LG] UPDATED)
    Many policy gradient methods are variants of Actor-Critic (AC), where a value function (critic) is learned to facilitate updating the parameterized policy (actor). The update to the actor involves a log-likelihood update weighted by the action-values, with the addition of entropy regularization for soft variants. In this work, we explore an alternative update for the actor, based on an extension of the cross entropy method (CEM) to condition on inputs (states). The idea is to start with a broader policy and slowly concentrate around maximal actions, using a maximum likelihood update towards actions in the top percentile per state. The speed of this concentration is controlled by a proposal policy, that concentrates at a slower rate than the actor. We first provide a policy improvement result in an idealized setting, and then prove that our conditional CEM (CCEM) strategy tracks a CEM update per state, even with changing action-values. We empirically show that our Greedy AC algorithm, that uses CCEM for the actor update, performs better than Soft AC and is much less sensitive to entropy-regularization.
    A label efficient two-sample test. (arXiv:2111.08861v3 [cs.LG] UPDATED)
    Two-sample tests evaluate whether two samples are realizations of the same distribution (the null hypothesis) or two different distributions (the alternative hypothesis). We consider a new setting for this problem where sample features are easily measured whereas sample labels are unknown and costly to obtain. Accordingly, we devise a three-stage framework in service of performing an effective two-sample test with only a small number of sample label queries: first, a classifier is trained with samples uniformly labeled to model the posterior probabilities of the labels; second, a novel query scheme dubbed \emph{bimodal query} is used to query labels of samples from both classes, and last, the classical Friedman-Rafsky (FR) two-sample test is performed on the queried samples. Theoretical analysis and extensive experiments performed on several datasets demonstrate that the proposed test controls the Type I error and has decreased Type II error relative to uniform querying and certainty-based querying. Source code for our algorithms and experimental results is available at \url{https://github.com/wayne0908/Label-Efficient-Two-Sample}.
    Model-based Clustering with Missing Not At Random Data. (arXiv:2112.10425v2 [stat.ML] UPDATED)
    Traditional ways for handling missing values are not designed for the clustering purpose and they rarely apply to the general case, though frequent in practice, of Missing Not At Random (MNAR) values. This paper proposes to embed MNAR data directly within model-based clustering algorithms. We introduce a mixture model for different types of data (continuous, count, categorical and mixed) to jointly model the data distribution and the MNAR mechanism. Eight different MNAR models are proposed, which may depend on the underlying (unknown) classes and/or the values of the missing variables themselves. We prove the identifiability of the parameters of both the data distribution and the mechanism, whatever the type of data and the mechanism, and propose an EM or Stochastic EM algorithm to estimate them. The code is available on \url{https://github.com/AudeSportisse/Clustering-MNAR}. %\url{https://anonymous.4open.science/r/Clustering-MNAR-0201} We also prove that MNAR models for which the missingness depends on the class membership have the nice property that the statistical inference can be carried out on the data matrix concatenated with the mask by considering a MAR mechanism instead. Finally, we perform empirical evaluations for the proposed sub-models on synthetic data and we illustrate the relevance of our method on a medical register, the TraumaBase$^{\mbox{\normalsize{\textregistered}}}$ dataset.
    Detecting Model Misspecification in Amortized Bayesian Inference with Neural Networks. (arXiv:2112.08866v3 [stat.ME] UPDATED)
    Recent advances in probabilistic deep learning enable amortized Bayesian inference in settings where the likelihood function is implicitly defined by a simulation program. But how faithful is such inference when simulations represent reality somewhat inaccurately? In this paper, we conceptualize the types of model misspecification arising in simulation-based inference and systematically investigate the performance of SNPE-C (APT) and the BayesFlow framework under these misspecifications. We propose an augmented optimization objective which imposes a probabilistic structure on the learned latent data summary space and utilize maximum mean discrepancy (MMD) to detect potentially catastrophic misspecifications during inference undermining the validity of the obtained results. We verify our detection criterion on a number of artificial and realistic misspecifications, ranging from toy conjugate models to complex models of decision making and disease outbreak dynamics applied to real data. Further, we show that posterior inference errors increase when the distance between the latent summary distributions of the true data-generating process and the training simulations grows. Thus, we demonstrate the dual utility of MMD as a method for detecting model misspecification and as a proxy for verifying the faithfulness of amortized simulation-based Bayesian inference.
    FiLM: Frequency improved Legendre Memory Model for Long-term Time Series Forecasting. (arXiv:2205.08897v1 [cs.LG])
    Recent studies have shown the promising performance of deep learning models (e.g., RNN and Transformer) for long-term time series forecasting. These studies mostly focus on designing deep models to effectively combine historical information for long-term forecasting. However, the question of how to effectively represent historical information for long-term forecasting has not received enough attention, limiting our capacity to exploit powerful deep learning models. The main challenge in time series representation is how to handle the dilemma between accurately preserving historical information and reducing the impact of noisy signals in the past. To this end, we design a \textbf{F}requency \textbf{i}mproved \textbf{L}egendre \textbf{M}emory model, or {\bf FiLM} for short: it introduces Legendre Polynomial projections to preserve historical information accurately and Fourier projections plus low-rank approximation to remove noisy signals. Our empirical studies show that the proposed FiLM improves the accuracy of state-of-the-art models by a significant margin (\textbf{19.2\%}, \textbf{22.6\%}) in multivariate and univariate long-term forecasting, respectively. In addition, dimensionality reduction introduced by low-rank approximation leads to a dramatic improvement in computational efficiency. We also demonstrate that the representation module developed in this work can be used as a general plug-in to improve the performance of most deep learning modules for long-term forecasting. Code will be released soon
    SoQal: Selective Oracle Questioning for Consistency Based Active Learning of Cardiac Signals. (arXiv:2004.09557v3 [cs.LG] UPDATED)
    Clinical settings are often characterized by abundant unlabelled data and limited labelled data. This is typically driven by the high burden placed on oracles (e.g., physicians) to provide annotations. One way to mitigate this burden is via active learning (AL) which involves the (a) acquisition and (b) annotation of informative unlabelled instances. Whereas previous work addresses either one of these elements independently, we propose an AL framework that addresses both. For acquisition, we propose Bayesian Active Learning by Consistency (BALC), a sub-framework which perturbs both instances and network parameters and quantifies changes in the network output probability distribution. For annotation, we propose SoQal, a sub-framework that dynamically determines whether, for each acquired unlabelled instance, to request a label from an oracle or to pseudo-label it instead. We show that BALC can outperform start-of-the-art acquisition functions such as BALD, and SoQal outperforms baseline methods even in the presence of a noisy oracle.
    Sharp asymptotics on the compression of two-layer neural networks. (arXiv:2205.08199v2 [cs.IT] UPDATED)
    In this paper, we study the compression of a target two-layer neural network with N nodes into a compressed network with M < N nodes. More precisely, we consider the setting in which the weights of the target network are i.i.d. sub-Gaussian, and we minimize the population L2 loss between the outputs of the target and of the compressed network, under the assumption of Gaussian inputs. By using tools from high-dimensional probability, we show that this non-convex problem can be simplified when the target network is sufficiently over-parameterized, and provide the error rate of this approximation as a function of the input dimension and N . For a ReLU activation function, we conjecture that the optimum of the simplified optimization problem is achieved by taking weights on the Equiangular Tight Frame (ETF), while the scaling of the weights and the orientation of the ETF depend on the parameters of the target network. Numerical evidence is provided to support this conjecture.
    Bagged Polynomial Regression and Neural Networks. (arXiv:2205.08609v1 [stat.ML])
    Series and polynomial regression are able to approximate the same function classes as neural networks. However, these methods are rarely used in practice, although they offer more interpretability than neural networks. In this paper, we show that a potential reason for this is the slow convergence rate of polynomial regression estimators and propose the use of bagged polynomial regression (BPR) as an attractive alternative to neural networks. Theoretically, we derive new finite sample and asymptotic $L^2$ convergence rates for series estimators. We show that the rates can be improved in smooth settings by splitting the feature space and generating polynomial features separately for each partition. Empirically, we show that our proposed estimator, the BPR, can perform as well as more complex models with more parameters. Our estimator also performs close to state-of-the-art prediction methods in the benchmark MNIST handwritten digit dataset.
    Bayesian Inference with Nonlinear Generative Models: Comments on Secure Learning. (arXiv:2201.09986v2 [cs.IT] UPDATED)
    Unlike the classical linear model, nonlinear generative models have been addressed sparsely in the literature. This work aims to bring attention to these models and their secrecy potential. To this end, we invoke the replica method to derive the asymptotic normalized cross entropy in an inverse probability problem whose generative model is described by a Gaussian random field with a generic covariance function. Our derivations further demonstrate the asymptotic statistical decoupling of Bayesian inference algorithms and specify the decoupled setting for a given nonlinear model. The replica solution depicts that strictly nonlinear models establish an all-or-nothing phase transition: There exists a critical load at which the optimal Bayesian inference changes from being perfect to an uncorrelated learning. This finding leads to design of a new secure coding scheme which achieves the secrecy capacity of the wiretap channel. This interesting result implies that strictly nonlinear generative models are perfectly secured without any secure coding. We justify this latter statement through the analysis of an illustrative model for perfectly secure and reliable inference.

  • Open

    [D] Adversarial testing for Fairness and Biases
    Is it worthwhile exposing ML models to the public for adversarially testing if it meets fairness conditions given that there are so many edge cases to test for when deploying ML model? (Similar to the concept of bug bounties in cybersecurity) submitted by /u/blitzkreig3 [link] [comments]
    [P] Keras Launches a Computer Vision Extension Package
    Keras has launched a computer vision extension package. ​ Links: - https://keras.io/keras_cv/ - https://github.com/keras-team/keras-cv/ submitted by /u/puppet_pals [link] [comments]
    [N] Introducing Accelerated PyTorch Training on Mac
    https://pytorch.org/blog/introducing-accelerated-pytorch-training-on-mac/ submitted by /u/eigenlaplace [link] [comments]  ( 1 min )
    [Discussion] Are there any better Topic Modelling algorithms/models other than LDA?
    As the title suggests, could someone point me to some new Topic Modelling algorithms that have come up recently that are in some way better? My use-case is towards modelling tweets, if that helps! TIA EDIT: Approach is location based, meaning I will be increasing the document length by merging nearby tweets. submitted by /u/mrnerdy59 [link] [comments]  ( 1 min )
    [R] How is F1-score a better metric for unbalanced data sets ?
    Alright so bear with me; I have an unbalanced data set, let's say I have a +1 class accounting for 80% of my data set and a -1 class representing the remaining 20%. I fit some kind of model and report both accuracy and F1-score. F1-score is supposed to counteract the skewed data distribution but since F1 is computed on the +1 class, which is the majority class, it still going to be biased right ? To me, the answer would be taking the average F1-score for both classes, which sklearn calls the "macro" F1 score (I'm aware there are 2 very different takes on how "macro" metrics are computed) but people on the internet seem to get very angry when we use macro F1 for binary classification. Although it sounds like a very reasonable choice to me...What do you guys think ? ​ Note: I'm doing a thesis in quantitative finance and a lot of papers use accuracy to artificially boost their models predictive power, I'm trying to prove why this shouldn't be and why macro f1 would yield much more realistic results. submitted by /u/delta9_ [link] [comments]  ( 2 min )
    [D] Those at FAANG, do you use AutoML?
    Do you use AutoML? I see Microsoft, Google, Amazon and etc... all have their own AutoML SDK/Services. submitted by /u/stevofolife [link] [comments]  ( 1 min )
    [D] KDD Notifications Thread
    I don't think I usually see the data mining conference much here. So anyway, hoping to hear that KDD might love me submitted by /u/EdwardRaff [link] [comments]
    [N] Flower Summit 2022
    On May 31, 2022, the Flower Community will come together for the second Flower Summit 2022. Join experts in the field of federated learning and find out how Flower accelerates the development of systems in both research and production scenarios. All speakers and the corresponding time schedule are final now. You can expect speakers from Intel, Google/MLCommons, Brave, University of Cambridge, AI Sweden, and many more. Block your calendar and register now: https://flower.dev/conf/flower-summit-2022/ https://preview.redd.it/1hcnl3f678091.png?width=1200&format=png&auto=webp&s=b1cc18b40435e8c9ae21d825b43a5b2143fafc2e submitted by /u/burnai [link] [comments]  ( 1 min )
    [D] MNIST equivalent dataset for RNN/LSTM/Transformer?
    Hi, I'm developing a course material and looking for a nice introductory task/dataset for RNN/LSTM/Transformers. I can use recurrent networks for MNIST too, but I'm looking for a more "classic" example such as sequence prediction or classification. Is there any? Preferably it's relatively simple, clean and well-known dataset like MNIST. Thank you very much. submitted by /u/euske [link] [comments]  ( 2 min )
    [R] Handcrafted localized phase features for human action recognition
    This paper https://paperswithcode.com/paper/handcrafted-localized-phase-features-for claims to achieve 98% top-1 accuracy on kinetics-400 and 96.35 on kinetics-700. From their description, they compute phase-correlation on large patches between consecutive frames and then use that in a knn-classifier. I didn't find any extra info in the paper about the method and frankly I find it hard to believe this beats all of the recent state-of-the art methods. What do you think? Maybe a (possibly uninteded) foul in the evaluation method? submitted by /u/AmirRosenfeld [link] [comments]  ( 1 min )
    [R] Learning the Dynamics of Physical Systems from Sparse Observations with Finite Element Networks
    submitted by /u/martenlienen [link] [comments]  ( 1 min )
    [D] Paper for Object Detection
    Recently, my team and I created a dataset with annotated labels for object detection. We plan to make a paper out of it. However, object detection is quite trivial and by that I mean that we try 3-4 approaches (SSD, Faster R-CNN, etc) with some optimization (e.g. same parameters on all approaches and fine-tune best performed?). Is it good material for a paper? I think if we present the findings as a dataset and benchmark experiments it would be sufficient. Shall we try more complex pipelines? submitted by /u/giakou4 [link] [comments]  ( 2 min )
    [D] Any recommended do's/don't for rebuttal phase ?
    Rebuttal is the stage where pre-lim reviews have been released to the authors. Now is the time for authors to address the concerns of reviewers. Do you have some essential do's/don't guidelines which you follow? It maybe about writing responses which disagree with a reviewer. Perhaps, presents results on additional experiments which were requested. submitted by /u/PaganPasta [link] [comments]  ( 1 min )
    [N] Apple Executive Who Left Over Return-to-Office Policy Joins Google AI Unit: Ian Goodfellow, a former director of machine learning at Apple, is joining DeepMind.
    According to an article published in Bloomberg, An Apple Inc. executive who left over the company’s stringent return-to-office policy is joining Alphabet Inc.’s DeepMind unit, according to people with knowledge of the matter. Ian Goodfellow, who oversaw machine learning and artificial intelligence at Apple, left the iPhone maker in recent weeks, citing the lack of flexibility in its work policies. The company had been planning to require corporate employees to work from the office on Mondays, Tuesdays and Thursdays, starting this month. That deadline was put on hold Tuesday, though. https://www.bloomberg.com/news/articles/2022-05-17/ian-goodfellow-former-apple-director-of-machine-learning-to-join-deepmind submitted by /u/hardmaru [link] [comments]  ( 2 min )
    [D] Are there any Tracking approaches using DeepSORT + Detector that isnt Yolo
    Hey, ive been wondering, all Repositories I could find on the internet are Yolo Detectors combined with DeepSort. I couldnt find others, what Im interested in the most would be Efficientdet or CentreNet combined with DeepSORT, are there any Repositories you guys know of? The last few Days I found my faszination for the Tensorflow 2 Object Detection API, I was wondering If I could use my weights from there combined with DeepSORT Best Regards submitted by /u/HolidayLobster1355 [link] [comments]  ( 1 min )
  • Open

    Google AI explains how Assistant answers your follow-up questions
    submitted by /u/Zirius_Sadfaces [link] [comments]
    The dream of destruction (made with starryai)
    submitted by /u/akhlys98 [link] [comments]
    Whats are your opinions on The Vital Intelligence the body vital scanning technology
    submitted by /u/BossBossZR [link] [comments]
    AI Dream 48 - Space Odyssey HAL 9000
    submitted by /u/LordPewPew777 [link] [comments]
    Researchers use artificial intelligence to help autonomous vehicles avoid idling at red lights.
    submitted by /u/qptbook [link] [comments]
    Glass painting decor.
    submitted by /u/cookingandcraft [link] [comments]
    This article showcases how to annotate semi- structured texts whether its pdfs or scanned images using UBIAI’s annotation tool
    submitted by /u/UBIAI [link] [comments]
    Yikes, the subbot has gone a little racist
    submitted by /u/orgeezuz [link] [comments]
    Using Deep Learning for Programming Language Translation
    submitted by /u/VikasOjha666 [link] [comments]
    Google has found a new use case for large AI language models: Job interviews.
    submitted by /u/much_successes [link] [comments]
    Purple Sunset (made with starryai)
    submitted by /u/Losthel [link] [comments]
    Your AI Weekly Digest (newsletter)! I just shared a new iteration covering BlobGAN
    submitted by /u/OnlyProggingForFun [link] [comments]
    Twitter Art Club
    submitted by /u/VIRUS-AOTOXIN [link] [comments]
    AIRS in the AIR: Modular Self-reconfigurable Robot (Session 2) Speakers: Michael Rubenstein and Tin Lun Lam
    submitted by /u/nousetest [link] [comments]  ( 1 min )
    Exploring the potential of MIDI sequence generation using Machine Learning - Survey
    I am a student currently studying Software Development at MCAST. For my degree thesis, I am exploring the potential of MIDI sequence generation using Machine Learning techniques. To evaluate the implemented algorithm, I created a questionnaire which asks respondents to rate 10 different samples. No personally identifiable information will be collected in this questionnaire. I would greatly appreciate if you can spare around 5 minutes to take part in this questionnaire. https://www.survio.com/survey/d/Y1W7D8P1X3J7F8U7Y submitted by /u/drinu98 [link] [comments]  ( 1 min )
    9 Best Artificial Intelligence books for beginners to expert to read in 2022
    submitted by /u/maneesh123456 [link] [comments]
  • Open

    Logging in Python
    Logging is a way to store information about your script and track events that occur. When writing any complex script in Python, logging is essential for debugging software as you develop it. Without logging, finding the source of a problem in your code may be extremely time consuming. After completing this tutorial, you will know: […] The post Logging in Python appeared first on Machine Learning Mastery.  ( 22 min )
  • Open

    DALL·E 2 Research Preview Update
    Early users have created over 3 million images to date and helped us improve our safety processes. We're excited to begin adding up to 1,000 new users from our waitlist each week.  ( 1 min )
  • Open

    How to concatenate feature vectors of different length?
    I have a very basic question. In an architecture like the one in the attached image, how do you concretely concatenate the feature vector coming from the conv nets and the attention WITH the feature vector coming from the scalar observation (direction, position, etc.)? One of them might be something like a 1x10000 vector, while the other two might be way shorter, say 1x4 for the direction and 1x2 for the position ​ https://arxiv.org/abs/2104.07750 submitted by /u/No_Possibility_7588 [link] [comments]  ( 1 min )
    Generative Trajectory Modelling : a "complete shift" in the Reinforcement Learning paradigm.
    submitted by /u/moschles [link] [comments]  ( 1 min )
    Installing & Using MuJoCo 2.1.5 with OpenAi Gym
    Hi all, I finally got my environment set up with MuJoCo and now I would like to use it through OpenAI Gym to train some agents for. I was reading that before deepmind took it over the installation process was very annoying. Is it still that way or is there an easier solution meanwhile? Following the instruction on https://github.com/openai/mujoco-py just with the 2.1.5 version and I get following error when I try to import mujoco_py ​ Also generally asked: Is there a better alternative to OpenAI Gym to work with the newer Mujoco versions? I'ld be very thankfull for any tipps :) ​ https://preview.redd.it/i47fibtpw7091.png?width=1113&format=png&auto=webp&s=7f6300a24b75e2c9d7938859b6de35982a502801 submitted by /u/disdisinform [link] [comments]  ( 2 min )
    10 Best Deep Reinforcement Learning Courses
    submitted by /u/MlTut [link] [comments]
    Replacing Model Based Predictive Control with RL, does anyone have any resources that I can look into for such an activity... for now I'm just thinking about using an active MPC for "training".
    submitted by /u/veezion123 [link] [comments]  ( 1 min )
    Double DQN algorithms converge on only one action.
    I have taken some reference implementations of DDQN algorithm and am trying to create an agent which can trade in the forex market. Unfortunately from the 2nd trial onwards (after training the DDQN for the first time) , the probability distribution of the actions converges on only action and the loss and the reward loss fluctuates. Dataset - 13k Batch_size - 64 Update_rl - 6 Learning rate - 0.001 Gamma - 0.99 Reward - -1 to 1(depend upon profit and loss) submitted by /u/laxuu [link] [comments]  ( 1 min )
    How to Leverage Reinforcement Learning • Phil Winder & Rebecca Nugent
    submitted by /u/goto-con [link] [comments]
    Are there any text-based game environments for RL agents to train on similar to Textworld or Jericho?
    submitted by /u/cz1xrnvz [link] [comments]  ( 1 min )
  • Open

    Use Amazon Lex to capture street addresses
    Amazon Lex provides automatic speech recognition (ASR) and natural language understanding (NLU) technologies to transcribe user input, identify the nature of their request, and efficiently manage conversations. Lex lets you create sophisticated conversations, streamline your user experience to improve customer satisfaction (CSAT) scores, and increase containment in your contact centers. Natural, effective customer interactions require […]  ( 10 min )
  • Open

    Part 2: Is Data Mesh Fool’s Gold? Not if You Avoid the Traps
    Wow, my blog “Is Data Mesh Fool’s Gold? Creating a Business-centric Data Strategy” created quite a stir.  And that was my intention. I actually believe that the Data Mesh is an important data management and governance framework (yes, the Data Mesh is more of a framework than a technology) for helping organizations deliver a business-driven… Read More »Part 2: Is Data Mesh Fool’s Gold? Not if You Avoid the Traps The post Part 2: Is Data Mesh Fool’s Gold? Not if You Avoid the Traps appeared first on Data Science Central.  ( 5 min )
  • Open

    Vector-Quantized Image Modeling with Improved VQGAN
    Posted by Jiahui Yu, Senior Research Scientist, and Jing Yu Koh, Research Software Engineer, Google Research In recent years, natural language processing models have dramatically improved their ability to learn general-purpose representations, which has resulted in significant performance gains for a wide range of natural language generation and natural language understanding tasks. In large part, this has been accomplished through pre-training language models on extensive unlabeled text corpora. This pre-training formulation does not make assumptions about input signal modality, which can be language, vision, or audio, among others. Several recent papers have exploited this formulation to dramatically improve image generation results through pre-quantizing images into discrete integer …  ( 7 min )
  • Open

    How to Make Oracle AIs Safe
    The Counterfactual Oracle AI Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 4 min )
  • Open

    Living better with algorithms
    Graduate student Sarah Cen explores the interplay between humans and artificial intelligence systems, to help build accountability and trust.  ( 7 min )

  • Open

    [N] Gradio Blocks + Hugging Face event is starting this week. A hackathon type event from May 17th to May 31st with prizes in which we will create interactive web demos for state-of-the-art machine learning models
    We are happy to invite you to the Gradio Blocks Party - a community event in which we will create interactive demos for state-of-the-art machine learning models. Demos are powerful because they allow anyone — not just ML engineers — to try out models in the browser, give feedback on predictions, identify trustworthy models. The event will take place from May 17th to 31st. We will be organizing this event on Huggingface: https://huggingface.co/Gradio-Blocks and the Hugging Face discord channel. Prizes will be given at the end of the event, see the Prizes section We will be building demos using the new Gradio Blocks API. Blocks allows you to build web-based demos in a flexible way using the Gradio library. Gradio is a popular choice for building demos for machine learning models, as it allows you to create web-based UIs all in Python. For example, here is a UI for Dall-E Mini using Gradio Blocks: ​ https://reddit.com/link/ury6a9/video/p8m2arag24091/player ​ submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 1 min )
    Are there any research groups that try to find quantitative laws of deep learning? [D]
    I’m a graduate student in a deep learning application area. How model development works is that you have a recipe of tricks and guidelines you play around with and cross-validate different versions of until you find one that works well. Most research trying to improve accuracy on proxy datasets seems a waste of time to me and I’m more interested in fundamental patterns in learning. There is this wealth of experiments the community is producing but the theoretical insight is nothing more than a bunch of rules of thumb. I feel like it’s time to split apart experimental and theoretical deep learning like experimental and theoretical Physics co-exists and try and figure out quantitative laws of deep learning. I get that deep learning is a non convex optimisation problem and that we can’t prove much, but what we can do is try and figure laws out. Newtons law of gravity was also not proven from some others axioms. While deep learning does not seem fully amenable to mathematical analysis, it is to physical analysis. I’m asking for some theories, not theorems. Any one know of any work on this or work in progress? I’m not talking about physics-informed deep learning, DL applied to physics or physical theory used to improve DL; just to be clear. Work I would find interesting would try to, from the wealth of experimental DL data, uncover fundamental, quantitative relationships between architecture, optimisation procedure, dataset and performance. submitted by /u/lemlo100 [link] [comments]  ( 2 min )
    [D]Practical correlation metric for a large number of vectors
    I am dealing with a timeseries consisting of input flow sampled every 5 minutes over 441 days. My aim is to find any possible correlation from data coming from: The same day of the week The same moment in time ( EX. 14:35,14:40 etc) I proceeded to sample according to weekdays and hours. Then I computed the 63x63 correlation matrix for each of the weekdays and a 441x441 for each hour, which in the second case is pretty impractical. I feel like the dataset is too broad to categorise it with a simple yes-or-no question yet that's what I've been tasked with.So my question is if I can try to do autocorrelation and if it results in some parameter p and q in ARIMA model or would you suggest me another more succinct approach that may give a broader picture of data? submitted by /u/Ambitious-Donut1321 [link] [comments]  ( 3 min )
    [D] What are the best information retrieval methods today?
    Given the advent of powerful language models using transformers, what IR systems are most effective today? Would love to read recent papers and blogs about the same! submitted by /u/evilBotman [link] [comments]  ( 1 min )
    [D] Is representative training data always a good thing?
    https://preview.redd.it/65gejwx5q2091.png?width=1000&format=png&auto=webp&s=107c54464a27b005bf139eabd405134dafe94d15 More like this at: https://www.evilaicartoons.com/ submitted by /u/HAIL-9000 [link] [comments]
    [D] Could you technically use continuous numeric labels instead of one-hot labels for multiclass classification?
    This would work almost like an embedding for label classes, every class corresponds to a vector of continuous numeric values. You could identify the correct class based on a nearest neighbouring label vector for a given inference. The only problem I see is how you would initialize the label embedding before training. One advantage I see is that you could add classes without retraining the network. submitted by /u/mrwafflezzz [link] [comments]  ( 1 min )
    [D] Which software packages are used for evaluation of automatic summarization besides ROUGE?
    Is there any easily findable software besides ROUGE for this purpose? submitted by /u/Goldback_Gorilla [link] [comments]
    [D] Understanding audio sampling function for a speech synthesis WGAN
    Hello, I've recently made a post on this subreddit asking for clarifications on a certain paper describing an implementation of WGAN-GP for speech synthesis from silent videos. Those answers were all really helpful in better understanding the learning process, however more have cropped up as I began digging deeper. I'm currently attempting training a hybrid model between the architectures described in these two papers, with a generator and objective function from the former and the critic and PASE blocks from the latter. After not seeing any signs of convergence after 12 hours of training (~20 epochs on 4 speakers from the GRID corpus, roughly 8000 seconds of data), I read further and found some info regarding the data sampling in both articles. The first one states: The audio clips …  ( 2 min )
    [D] Estimation of marginal likelihood - what's the SOTA
    Hi! I face a problem in which I should estimate the marginal likelihood. My differentiable model is of moderate dimensionality (circa 50) and I do have a decent sample from the posterior. The model is expensive to calculate/propagate and it's implemented in an AD framework. I checked some of the works in the field, many of them seem to be developed for cosmological data. I am aware of nested sampling, subdomain methods of M. D. Weinberg and Delaunay triangulation approaches. Still, as I completely fresh to the field, I don't know what to expect w.r.t. accuracy of the methods. What is your experience? Do you know of any decent comparison studies or other techniques worth exploring? submitted by /u/msusik [link] [comments]  ( 1 min )
    [D] How to choose dimensions for latent space ?
    Hi, I want to make a clustering on bioinformatics data. I have a matrix of 129 samples and 3000 features (OTUs). I was thinking as a preprocessing step to reduce the dimensions of this Matrix and see if I can achieve better clustering. I thought of using Autoencoders to get the latent representation of the data and then apply probably k-means and see the results. How can I choose the latent space dimensions ? I have read some papers where they do the same but they don't explain how they choose the dimensions. submitted by /u/grisp98 [link] [comments]  ( 5 min )
  • Open

    Ensemble Reinforcement Learning
    Could you point me to some papers that use ensembling techniques to improve efficiency of RL algorithms please? Is this an area that is actively researched? If so, what is the SOTA for it? submitted by /u/SirRantcelot [link] [comments]  ( 1 min )
    How to teach agent not to take invalid actions?
    Hi, I'm working in an environment where there are 3 possible actions, but some of them are sometimes invalid/impossible, depending on the state . I am giving the key piece of information about the current state (binary 0/1) to the agent in the observation space (among other info), in hopes that it would figure out that (when state=0 , I cannot take action 1 but when state=1, I cannot take action 3). To do this, I have tried: Give a huge penalty upon using an invalid move and ending the episode Give a huge penalty upon using an invalid move, not changing the state and allowing episode to continue Neither of these seemed to have any effect, as the agent continued making invalid moves. Any advice or insight would be greatly appreciated! I'm pretty new to RL. Thank you! submitted by /u/VladimirB-98 [link] [comments]  ( 2 min )
    Simulating random RGB images and observation space for RL model
    Hi, I'm a bit confused about the usage of RGB images in RL. I'm trying to create random tensors (that are meant to imitate a visual observation and an observation space) to feed in my model and see whether it works correctly. What I don't understand is: concretely, how would I randomly initialize a tensor for the observation space (for the init method) and one for the observation (for the forward method)? Would this be a realistic observation? And how would I initialize the obs space? obs = torch.randn(4, 64, 64, 3) Second question: as far as I understood, the first dimension with PyTorch is the batch size. What does it represent exactly? submitted by /u/No_Possibility_7588 [link] [comments]  ( 2 min )
    It is ok to roast me, but I need help about my project.
    I know that connect 4 is solved, but I am trying to make a value based method rl agents to play connect 4 with MARL. I trained both agents many times and the policy always converge to a very poor situation. They just commit and rush 4 at one of the column, not trying to block opponent, and the agent who go first always win after training a long time. here is my code. Please give me suggestions to improve and I know that MCTS is a good for this case, but I havent learn it yet, currently focusing on value based method, and also is eligibility traces dead? Note: I tried ANN(flatten the state), and it converge to stupid strategy faster, now I am using CNN(also get stupid policy) submitted by /u/Professional_Card176 [link] [comments]  ( 2 min )
    Observation vector comprising only of previous action and reward: Isn't that a multi-armed bandits problem?
    Hello redditors of RL, I am doing joint research on RL and Wireless Comms. and I am observing a trend in a lot of the problem formulations people use there: Sometimes, the observation vector of the "MDP" is defined as simply containing the past action and reward (usually without any additional information). Given that all algorithms collect experience tuples of (s, a, r, s'), would you agree with the following statements? Assuming a discrete action space, if st contains only [at-1,rt-1] , isn't that the same as having no observations? Since you already have this information in your experience tuple. Taking it a step further, isn't that a multi-armed bandits scenario? I.e. assuming the stochastic process that generates the rewards is stationary, the optimal "policy" essentially selects always one action. This is not an MDP (or rather, it is "trivially" an MDP), won't you agree? Even if st includes other information, isn't the incorporation of [at-1,rt-1] simply unnecessary? Assuming continuous action space, couldn't this problem be treated similar to the (discrete) multi-armed bandits problem, as long as you adopt a parametric model for learning the distributions of the rewards conditioned on the actions? submitted by /u/SomeParanoidAndroid [link] [comments]  ( 2 min )
    Sustainability applications
    Is there any recent paper or work done applying RL in the field of sustainability and sustainable development? submitted by /u/blitzkreig3 [link] [comments]
  • Open

    Contextual Rephrasing in Google Assistant
    Posted by Aurelien Boffy, Senior Staff Software Engineer, and Roberto Pieraccini, Engineering Director, Google Assistant When people converse with one another, context and references play a critical role in driving their conversation more efficiently. For instance, if one asks the question “Who wrote Romeo and Juliet?” and, after receiving an answer, asks “Where was he born?”, it is clear that ‘he’ is referring to William Shakespeare without the need to explicitly mention him. Or if someone mentions “python” in a sentence, one can use the context from the conversation to determine whether they are referring to a type of snake or a computer language. If a virtual assistant cannot robustly handle context and references, users would be required to adapt to the limitation of the technology by…  ( 7 min )
  • Open

    Can artificial intelligence overcome the challenges of the health care system?
    MIT and Mass General Brigham researchers and physicians connect in person to bring AI into mainstream health care.  ( 6 min )
    On the road to cleaner, greener, and faster driving
    Researchers use artificial intelligence to help autonomous vehicles avoid idling at red lights.  ( 7 min )
  • Open

    How much money does one day intensive use of openai cost ?
    submitted by /u/huberpaul [link] [comments]  ( 1 min )
    Microsoft AI Team Introduces “Federated Learning Utilities and Tools for Experimentation” (FLUTE): A High-Performance Open-Source Platform For Federated Learning Research And Offline Simulations
    Distributed Training (DT), which focuses on scaling the model training process via model or data parallelism, has gotten much interest because of an increase in training datasets. On the other hand, DT makes some assumptions, particularly in terms of communication and network parameters. Furthermore, new data management restrictions are arising due to the growing requirement for personal data protection, making data more inaccessible due to storage behind firewalls or on users’ devices without the option of being shared for centralized training. Federated Learning is a decentralized machine learning approach that emphasizes collaborative training and data privacy for users. The central concept underlying federated learning is that these machine learning models are highly versatile when it comes to training sophisticated models over large amounts of data without having to share that data with a centralized body. Despite its popularity as a research topic, it is challenging to deploy since it differs significantly from typical machine learning pipelines. Local data variety, end-node hardware diversity, privacy concerns, and optimization limits are challenges in federated learning. Furthermore, federated learning applications frequently need to extend the learning process to millions of clients to imitate a real-world environment. These difficulties highlight the necessity for a simulation platform that allows researchers and developers to conduct proof-of-concept implementations and verify performance before creating and deploying their machine learning models. Quick Read Paper: https://arxiv.org/pdf/2203.13789.pdf Github: https://github.com/microsoft/msrflute submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Overview of XGBoost and Gradient Boosting
    submitted by /u/aidev2040 [link] [comments]
    "I used AI to generate a music video from song lyrics" by [DoodleChos]
    submitted by /u/VegiHarry [link] [comments]  ( 1 min )
    I've finally launched MagicNote - a new platform where you can write better in your second language with GPT-3
    Hello Guys, I'm happy to announce I've finally launched MagicNote - a new platform where you can write better on demand for your work or business in your second language using gpt-3. https://magicnote.ai It was a wild ride getting here, but I finally made it after 10-11 months of hard work. The majority of my time was spent validating the idea, not the product. I've done my best to make my AI writing assistant as affordable as possible – it's one of the cheapest on the market. For me, this is more about making a difference in people’s lives than making money. To be honest, I developed it primarily for myself and have been still using it. Communication with colleagues, clients, potential customers, or employees matters, especially if it is in your foreign language writing. MagicNote helps me improve the quality of my communication with such people while writing on demand. Just thought it could be beneficial to thousands of other people like me. Feel free to join the platform if you are one of them. Looking forward to your feedback. submitted by /u/data-gig [link] [comments]  ( 1 min )
    This video is part of my entry to the Future of Life Institute's Worldbuild AI project to envision a positive future with AGI in 2045. I'd love feedback!
    submitted by /u/Turil [link] [comments]  ( 1 min )
    [English Speakers] [Feedback] [Survey] [Populism] [AI] I would greatly appreciate your feedback on my survey
    Hello! For my Bachelor's thesis, I am examining the effects of populism on people's attitudes. Before I send out my survey for data collection I would like to ask for some of your feedback with the hopes of improving my questions and stimulus material. I would greatly appreciate it if you could take 5-7 minutes to fill out this survey and then let me know what you liked and disliked about the study. Thank you for your help! https://uvacommscience.eu.qualtrics.com/jfe/form/SV_3w9GzcvxQ7fWRzo submitted by /u/Psychological_Face69 [link] [comments]  ( 1 min )
    Javascript vs Python for variety of uses (AI, App Development, etc…)
    I currently know Java basics well (I have completed AP CS A) but the language doesn’t completely suit my needs. I am currently in the summer prior to doing my bachelors in Computer Science. Over the next few months, I would like to begin learning Artificial intelligence and I would like to first learn the syntax of the language I’ll use. Along with AI, I’d like to have the freedom of programming apps on both IOS & Android, and also maybe get into web development (maybe make basic websites where I can apply my knowledge in AI, create a personal portfolio, etc…). I would like for the language that I will mainly use to be easily used in things like app and web development along with AI in order for me to be able to apply the things I learn in AI through personal projects. This is really important for me because as I was learning Java for school, I was extremely limited. For example, Java app development kept me stuck with android and I couldn't approach AI with Java. I’m leaning more towards JavaScript as it’s a language I’ve never tried but I am not completely sure what to choose that will be the most suiting. submitted by /u/obvslynot [link] [comments]  ( 2 min )
    26 Images Created by Dall-E 2
    submitted by /u/kbf_ [link] [comments]
    Nvidia Canvas is still really cool for anyone waiting for DALL E 2
    submitted by /u/BeginningRealistic49 [link] [comments]
    The shell plugin I wrote writes your git commands
    submitted by /u/tomd_96 [link] [comments]  ( 1 min )
  • Open

    Overview of XGBoost and Gradient Boosting
    submitted by /u/aidev2040 [link] [comments]
    Machine Learning with Harsh
    submitted by /u/mr-minion [link] [comments]  ( 1 min )
  • Open

    Mission Made Possible: Real-Time Rendering Helps Studio Create Cinematic Battle Between Characters From ‘Diablo Immortal’
    Real-time rendering is helping one studio take virtual production to impossible heights. In their latest project, the creators at Los Angeles-based company Impossible Objects were tasked with depicting an epic battle between characters from the upcoming video game, Diablo Immortal. But the showdown had to take place on the surface of a Google Pixel phone, Read article > The post Mission Made Possible: Real-Time Rendering Helps Studio Create Cinematic Battle Between Characters From ‘Diablo Immortal’ appeared first on NVIDIA Blog.  ( 3 min )
    AI on the Ball: Startup Shoots Computer Vision to the Soccer Pitch
    Eyal Ben-Ari just took his first shot on a goal of bringing professional-class analytics to amateur soccer players. The CEO of startup Track160, in Tel Aviv, has seen his company’s AI-powered sports analytics software tested and used in the big leagues. Now he’s turning his attention to underserved amateurs in the clubs and community teams Read article > The post AI on the Ball: Startup Shoots Computer Vision to the Soccer Pitch appeared first on NVIDIA Blog.  ( 3 min )
    Concept Artist Pablo Muñoz Gómez Enlivens Fantasy Creatures ‘In the NVIDIA Studio’
    Concept artist Pablo Muñoz Gómez dives In the NVIDIA Studio this week, showcasing artwork that depicts a fantastical myth. Gómez, a creator based in Australia, is equally passionate about helping digital artists, teaching 3D classes and running the Zbrush guides website with his creative specialties: concept and character artistry. The post Concept Artist Pablo Muñoz Gómez Enlivens Fantasy Creatures ‘In the NVIDIA Studio’ appeared first on NVIDIA Blog.  ( 4 min )
  • Open

    Customize pronunciation using lexicons in Amazon Polly
    Amazon Polly is a text-to-speech service that uses advanced deep learning technologies to synthesize natural-sounding human speech. It is used in a variety of use cases, such as contact center systems, delivering conversational user experiences with human-like voices for automated real-time status check, automated account and billing inquiries, and by news agencies like The Washington […]  ( 7 min )
  • Open

    Chemical element abbreviation patterns
    I’ve wondered occasionally about the patterns in how chemical elements are abbreviated. If you don’t know the abbreviation for an element, is there a simple algorithm that would let you narrow the range of possibilities or improve your odds at guessing? Here’s a survey of how the elements are abbreviated. Latin and German The elements […] Chemical element abbreviation patterns first appeared on John D. Cook.  ( 3 min )
  • Open

    How to Use an AI Story Generator to Write Your Stories
    PS: This entire article was written by an AI story generator: Jasper AI.  ( 4 min )
  • Open

    AI Harms are Societal, Not Just Individual
    Not just Individual, but Societal Harms Case Study: Privacy and surveillance Case Study: Disinformation and erosion of trust Individual Harms, Individual Solutions Parallels with Environmental Harms Directions Forward Not just Individual, but Societal Harms When the USA government switched to facial identification service ID.me for unemployment benefits, the software failed to recognize Bill Baine’s face. While the app said that he could have a virtual appointment to be verified instead, he was unable to get through. The screen had a wait time of 2 hours and 47 minutes that never updated, even over the course of weeks. He tried calling various offices, his daughter drove in from out of town to spend a day helping him, and yet he was never able to get a useful human answer on what he was su…  ( 7 min )

  • Open

    [D] How to nominate NeurIPS 2022 reviewer?
    I think last year they have a link to nominate/self-nominate reviewer but I couldn't find any this year. Do any of you know how to do it? submitted by /u/Deep_forgetting [link] [comments]
    [D] Anyone still using Stochastic Depth?
    I recently studied ways to improve the training time of big neural networks, especially ResNets. On my way, I could not help but notice the big claims you can find in the paper Deep Networks with Stochastic Depth To summarize informally, their contribution consists of a new hyperparameter for ResBlocks, which is used to skip the inner part of the residual connection with the given probability (they use 0.5 in their experiments). Quoting from the Paper: Let b ∈ {0, 1} denote a Bernoulli random variable [...] If b = 1, eq. (2) reduces to the original ResNet update and this ResBlock remains unchanged. If b = 0, the ResBlock reduces to the identity function. They do not only claim a big improvement on training time Following the calculations above, approximately 25% of training time …  ( 2 min )
    [R] Survey on Misuse of NLP Research
    We at CopeNLU and the Digital Democracies Institute are currently running an online survey on the potential harms and misuses of Natural Language Processing technologies and research. We, therefore, ask researchers in the field of natural language processing to fill out the following survey to give us an insight into their concerns. We would really appreciate it if you could take a few minutes to fill out the survey. The survey takes about 20 minutes to complete and is available here: copenlu.limesurvey.net/987789 submitted by /u/frimelle [link] [comments]  ( 1 min )
    [P] A quick tip on DataFrame.apply
    I have been using pandas for years in many projects. But I still feel pain when applying a function to multiple columns. So recently developed a tool for our project, making life easier for data scientists like me. pandas.DataFrame we used everyday: ```python create a dataframe df = pd.DataFrame({'a': range(5)}) a 0 0 1 1 2 2 3 3 4 4 ``` By wrapping the pandas.DataFrame with Towhee, we have runas_op as an alternative to DataFrame.apply ``` data collection is a wrapper for dataframe dc = towhee.from_df(df) dc.runas_op['a', 'b'](func=lambda x: x+1) a b 0 0 1 1 1 2 2 2 3 3 3 4 4 4 5 ``` runas_op['a', 'b'](func=...) tells towhee to use column a as input and column b as output. For multiple inputs functions, we can use a tuple to specify input columns: ```python dc.runas_op[('a', 'b'), 'c'](func=lambda x, y: x + y) a b c 0 0 1 1 1 1 2 3 2 2 3 5 3 3 4 7 4 4 5 9 ``` We can also use a tuple for multiple outputs: ```python apply a multiple output function dc.runas_op['c', ('d', 'e')](func=lambda x: (x+1, x-1)) a b c d e 0 0 1 1 2 0 1 1 2 3 4 2 2 2 3 5 6 4 3 3 4 7 8 6 4 4 5 9 10 8 ``` Towhee provides method-chaining style API, making the code easy to follow: python df = pd.DataFrame({'a': range(5)}) dc = towhee.from_df(df) \ .runas_op['a', 'b'](func=lambda x: x+1) \ .runas_op[('a', 'b'), 'c'](func=lambda x, y: x + y) \ .runas_op['c', ('d', 'e')](func=lambda x: (x+1, x-1)) To convert the data back (as pandas.DataFrame): ```python new_df = dc.df type(new_df) pandas.core.frame.DataFrame ``` The project's homepage is https://github.com/towhee-io/towhee, and you can find more about towhee by going through the documents. Would appreciate some feedback and contribution 🙂 submitted by /u/ok-reiase [link] [comments]  ( 2 min )
    [D] Colab GPU Assignment Algorithm
    Does anyone have a sense of how Colab determines what GPU you get? I have a Pro+ subscription and for a period of a month or two I was regularly getting A100s. As a result I wrote some software that required 40GB of memory. Now I can't seem to ever get an A100. Why can't Colab give any information on this issue Additionally I created a new Colab account to hopefully get the 'new user' bump. After 24 hours on an A100 the new account can't even get a V100 ​ submitted by /u/anon135797531 [link] [comments]  ( 1 min )
    [D] ML Datasets with instance interactions
    Hello, I'm looking for datasets that contain the interaction label of 2 instances, in order to test a DL model. Any type of dataset that has lines like would be a great fit! I was thinking that I could also "handcraft" such datasets, from a classification dataset, where I put the interaction of 2 instances from the same class to be 1, and the interaction of 2 instances from different classes to be 0. However, I'd like to discover dedicated datasets with interactions instead. If you are aware of such datasets, could you please point me to them? Thank you in advance! submitted by /u/Yuu_Aky [link] [comments]  ( 1 min )
    [N] TorchRL: PyTorch pre-release RL library is here!
    PyTorch ecosystem team has opensourced TorchRL, the RL dedicated PyTorch library. It's still WIP and it hasn't been officially released yet, but it's already good enough to be used in common research settings, including online / offline, on-policy / off-policy, meta-RL and such. It is quite efficient for a series of tasks: for model ensembling and meta-RL it leverages functorch's capabilities. Some functions are highly optimized to efficiently run on cuda (e.g. TD(lambda) returns). Examples currently include SAC, DDPG, PPO, REDQ and DQN. Let us know what you think of it, issues and PRs are welcome! Buy it, use it, break it, fix it... Doc and tutorials to come soon! submitted by /u/AdCool8270 [link] [comments]  ( 1 min )
    [D] Why do top speech/audio conferences like ICASSP and Interspeech have very high acceptance rates like 46%-48% ?
    I have heard from my fellow Ph.D students and post-docs that the lower the acceptance rate of conferences, the higher their reputation. I see that this is true for Neurips, ICLR, ICML, ACL etc. But ICASSP and Interspeech are considered top conferences in speech/audio applications. So why are their acceptance rates so high? submitted by /u/Far_Conversation_445 [link] [comments]  ( 3 min )
    [D] Using U-Net for 1D segmentation from 2D images ?
    Hi all ! ​ I am working on building a scanner to digitize film strips using a rectangular area sensor. This means that by using a rectangular sensor, I am lacking the entire context of the film during capture, but still need to align the images properly to the sensor so the resolution can be maximized with automatic capture and film alignment. ​ The issue at hand is then to segment the captured images between "image on the film", and "gap between images" areas. ​ I have tried using traditional CV methods but issues arise with poorly exposed films, since images can become very clear, pretty much as clear as the gaps between images, with only some sparse image elements visible. This is why I believe that the full context of the 2D image is necessary for segmentation. ​ Since the images captured always have the same orientation, and gaps between images are always vertical, I believe that a 1D output would be sufficient, aka "iiiigggiiiiiii" where "g" is a gap and "i" is the image area. The segmented areas should *always* span the full height of the image and have vertical borders if the output of the segmentation is a 2D image, hence why I believe that a 1D output is enough. ​ Is this something that can be done with a U-Net ? I am very much a beginner with this but seeing how it can be used for outputting 2Dx1mono images from 2Dx3 color inputs, I was wondering if this could be done. ​ Thanks a lot in advance ! submitted by /u/iAmTheAlchemist [link] [comments]  ( 2 min )
    [D] Is it possible to submit a pure math paper to NeurIPS?
    I have some theoretical results on a popular algorithm in ML (score-based generative models). But I am from a maths background. All I do is state these results and prove them. The proofs are not extremely deep or involved, but they are rigorous (so maybe a bit pedantic) and would require the reviewer to actually know some stochastic analysis. I do some small numerical experiments on toy data sets in 2 dimensions to illustrate the results. Are there any tips on how I could maximize my chances of such a paper getting published in NeurIPS? Or are my chances very low? submitted by /u/future_gcp_poweruser [link] [comments]  ( 3 min )
    [D] Using Inception and FID scores in training?
    Is it possible to use the Inception and FID scores in the training of a deep image generation model, i.e. to maximize the scores in a loss function, albeit this is "cheating"? If so, has anyone / any paper done it? Thanks for any pointer. submitted by /u/thanrl [link] [comments]  ( 1 min )
    [News] New Google tech - Geospatial API uses computer vision and machine learning to turn 15 years of street view imagery into a 3d canvas for augmented reality developers
    submitted by /u/imaginfinity [link] [comments]  ( 4 min )
  • Open

    Batch size in Spinnigup' PPO?
    Proximal Policy Optimization — Spinning Up documentation (openai.com) I'm trying to understand the code in SpinningUp's implementation of PPO. It seems that when training the policy and value function networks, they used a batch size of 4000, instead of a minibatch. Does anyone know why? Wouldn't a minibatch behave better? submitted by /u/Traditional-Brother9 [link] [comments]  ( 1 min )
    Entity Gym: A new entity based API for reinforcement learning environments
    submitted by /u/Programmierer [link] [comments]  ( 1 min )
    Can n step learning deal with sparse reward?
    submitted by /u/Professional_Card176 [link] [comments]
    "Emergent bartering behaviour in multi-agent reinforcement learning", Johanson et al 2022
    submitted by /u/gwern [link] [comments]  ( 1 min )
    How do you choose the size of your observation space when you create a custom environment?
    Perhaps naive question, but this is the first time that I create a custom env: how do you choose the size of, for example, the RGB image you're passing as an observation to the agent? It could be a 24x24x3 image, or it could be a 200x200x3 one, so based on what principle should I choose it? submitted by /u/No_Possibility_7588 [link] [comments]  ( 1 min )
    PPO - Log Std as trainable parameter?
    In many implementations of the PPO algorithm I see the STD of the policy distribution implemented as a learnable parameter. For example here: https://intellabs.github.io/coach/components/agents/policy_optimization/ppo.html However it seems like, this parameter does not change very much during the course of the training indepently what it's start value was. I would expect it to converge to some common value no matter how I set it initially. Otherwise it is just dependent on it's initial value, which I find very hard to tune. How do you implement the STD of the policy distribution? And how do you set its inital value? I would be glad for any recommendations... https://preview.redd.it/xhss11tuquz81.png?width=354&format=png&auto=webp&s=1a9b0f9b4240ca61fd05fb130bd0ccc5da85f8b9 submitted by /u/flxh13 [link] [comments]  ( 1 min )
    n-step TD vs λ-return? (performance)
    submitted by /u/Professional_Card176 [link] [comments]
    What's the point of the actor in actor critic
    Since the actor seems to be dictated by the crictic I'm not quite sure what the point of the actor is. Ofcourse the actor acts out the action, but could you not just let the critic decide the action by for example taking the highest Q value of converting the Q values to probabilities if you would want a stochastic approach. submitted by /u/Jobdriaan [link] [comments]  ( 2 min )
    Training a RL model to fit a curve
    I'm attempting to train a RL model to fit a curve and would appreciate your input to improve the performance. TL;DR: If anyone has an open source environment that does something similar to curve fitting, I would appreciate having a look at it! My setup is as follows: We have a true curve y_true, evaluated at regular intervals on some domain x. We also have a method to generate a simulated curve y_sim on the same domain x. The method for simulating the curve depends on a set of parameters. The action space is a box of the same dimension as the number of parameters for the curve simulation method. The parameters are then updated with each action as: params += 0.01*action (the actions range between -1 and 1, so 0.01 makes it such that the actions don't modify the parameters to drastically …  ( 3 min )
    TorchRL: PyTorch pre-release RL library is alive!
    PyTorch ecosystem team has opensourced TorchRL, the RL dedicated PyTorch library. It's still WIP and it hasn't been officially released yet, but it's already good enough to be used in common research settings, including online / offline, on-policy / off-policy, meta-RL and such. It is quite efficient for a series of tasks: for model ensembling and meta-RL it leverages functorch's capabilities. Some functions are highly optimized to efficiently run on cuda (e.g. TD(lambda) returns). Examples currently include SAC, DDPG, PPO, REDQ and DQN. Let us know what you think of it, issues and PRs are welcome! Buy it, use it, break it, fix it... Doc and tutorials to come soon! submitted by /u/AdCool8270 [link] [comments]  ( 2 min )
    Are the unsupervised RL experiments carried out correctly?
    Hi. I recently studied Unsupervised RL. It pre-trains agents with task-agnostic rewards and then fine-tunes downstream tasks. But All I saw at URLB that is unsupervised RL benchmark was worse than scratch. In my case, I didn't check all cases because of limited resource, I wonder your case. I can't get clear answer from repository owner and author. walker run and stand state-based action repeat 1 pretrain 2M batchsize 1024, finetune batchsize 256 scratch >> apt-icm walker run and stand pixel-based action repeat 2 pretrain 2M batchsize 1024, finetune batchsize 256 scratch > apt-icm ​ Thank you for reading. submitted by /u/Spiritual_Fig3632 [link] [comments]  ( 1 min )
  • Open

    Lessons From Deploying Deep Learning To Production
    submitted by /u/regalalgorithm [link] [comments]
    Last Week in AI: Enzyme developed with AI to decompose plastic, Hugging Face reaches $2 billion valuation, new ambitious EU AI Act, and more!
    submitted by /u/regalalgorithm [link] [comments]  ( 1 min )
    Breakthrough Google Deepmind Gato General AI Does 600+ Tasks | AI Robot Arm To Disarm Bombs
    submitted by /u/SlightSituation [link] [comments]
    AI Evolution: The Historic Timeline of AI Milestones
    Most folks think Artificial Intelligence (AI) is a novel notion, although it's been around for a long time. We went back in history and curated a list of all key artificial intelligence breakthroughs that have enabled us to live our current lives. Read more submitted by /u/ridamughal110 [link] [comments]
    AI Accelerators - Hardware For Artificial Intelligence
    CPUs were not as powerful and efficient a few decades ago when it came to running large computations for machine learning. Hardware manufacturers have worked hard to create a processing unit capable of performing any AI operation. Read more submitted by /u/ridamughal110 [link] [comments]
    Inworld AI's Ilya Gelfenbeyn and Kylan Gibbs talked about the tech behind their new developer platform for building AI-driven virtual characters and shared how to integrate their solution into games.
    submitted by /u/80lv [link] [comments]  ( 1 min )
    Can large language models be democratized?
    submitted by /u/bendee983 [link] [comments]
    Is it possible to program AI to detect cringe?
    submitted by /u/SimonCZ2 [link] [comments]
    A short story written by the AI - " Henry The Mighty Time Traveler "
    submitted by /u/Alive_Ad_2882 [link] [comments]
    1:37 / 22:17 • Paper Motivation Is Gato Really the Future of AI?
    submitted by /u/bartturner [link] [comments]
    AI poetry in rhyme proving difficult
    Hi, has anyone managed to get an AI to produce poetry in rhyme? I have tried to teach GPT-3 and Jurassic-1 to rhyme in ABAB style, after giving it around 50 example verses. Here is my prompt: "Human: You are AI, a brilliant AI poet. You write deep, beautiful and meaningful poetry in ABAB style. That means that in each 4 line verse, the last word in line 1 rhymes with the last word in line 3, and the last word in line 2 rhymes with the last word in line 4. Here are some examples of ABAB poems. Study how they are structured. Afterwards I will as you to write some ABAB poems like these:" Then I gave all examples. Then I ended with: "Human: That was all the examples. Please write some nice poetry in ABAB style like those poems. Each verse has to be 4 lines. AI: " Tried various tweaks to the prompt, but it really struggles to comprehend rhyme. I have seen perhaps two lines rhyming here and there, but it might be chance. I am using the largest models. Has anyone had any luck getting the models to rhyme? submitted by /u/Dinosaur-Owl [link] [comments]  ( 1 min )
    AI Implementation roadmap
    Take a look at AI implementation roadmap with real case examples https://www.toolbox.com/tech/artificial-intelligence/guest-article/ai-implementation-what-does-it-take-to-adopt-artificial-intelligence-in-business/#SocialMediaInterests submitted by /u/lklimusheuskaja [link] [comments]
    Craft Album.����
    submitted by /u/cookingandcraft [link] [comments]
    How to Leverage Conversational Chatbots to 10x Your E-commerce Sales?
    submitted by /u/mihircontra20 [link] [comments]
    Introduction to OpenCV and Image Processing with Python
    submitted by /u/RubiksCodeNMZ [link] [comments]
    Google maps immersive view - uses AI and computer vision to fuse billions of images with real-time traffic and weather, creating a 3d simulation of the world that shows you the vibe of a place
    submitted by /u/imaginfinity [link] [comments]  ( 2 min )
    Cambridge AI Researchers Propose ‘MAGIC’: A Training-Free Framework That Plugs Visual Controls Into The Generation Of A Language Model
    The release of Generative Pretrained Transformer (GPT-2) has fetched huge attention towards generative language models (LMs), which are pre-trained on massive amounts of unstructured text and have generated efficient results on a variety of NLP applications. LMs can produce texts constantly utilizing a textual prompt’s next-token prediction decoding approach. Models such as CLIP and ALIGN, pre-trained image-text joint embedding approaches, have revived multimodal illustration learning of text and images. Accordingly, it is challenging to integrate the benefits of pre-trained LMs and image-text embedding models to generate visually grounded text. The traditional approaches are generally limited by the object detectors trained with a fixed set of labels. Currently, the ZeroCap approach is ut…  ( 2 min )
    Are There Any Good Entirely Free Text-to-Image AI Generators that have API's?
    I have used wombo and it works really well, but there is no API for it. Are there any other Free Text-to-Image generators that have an API? submitted by /u/xbftw [link] [comments]
  • Open

    BREAKTHROUGH Google Deepmind Gato General AI Does 600+ Tasks With One Transformer Neural Network
    submitted by /u/getrich_or_diemining [link] [comments]
    Social Media networks as aggregate neural networks to explain political polarization and radicalization?
    With recent USA tragedies I find myself wondering if the advancing science of neural networks and machine learning, deep learning etc.. could be used to model or explain how social networks could function as a sort of aggregate neural network of a classical neural network "learning" and "training" that can also gain new nodes but are "trained" to conform to a certain pathway that literally breeds extremist views and polarization? I searched on JSTOR, etc.. but was wondering if anyone in the community knew someone or some research going on in this vein? submitted by /u/1nvent [link] [comments]  ( 1 min )
    Suggestion Regarding ML(Product Selector/Recommender) task
    Hello All, I am a student working on a B2B project called “Digital Product Selector based on questions”. So for example, if you go to a health care company’s website and you want to find out which product among them suits you best. You would have a set of questions on the website and based upon your answers, it would recommend you a product or products of that specific company. We have a static/rule based algorithm working fine when there are less products and less questions to answer for a specific company. However, for a company which has huge product list and more than 20 set of questions, the algorithm take significant amount of time to produce recommended products since being rule based and stuck in calculations. Now, I want to replace the rule based/static algorithm with any machine learning algorithm that I can train using my data. Please answer the following. 1) Can you please recommend me if there are any pre trained Neural Networks that I can use to address this use case? 2) Also, which problem statement does this use case belongs to? 3) we have recently started to work with AWS, if there are any AWS services available to address this use case, please recommend. submitted by /u/mubashir_ali93 [link] [comments]  ( 1 min )
    Breaking into the black box of artificial intelligence
    submitted by /u/nickb [link] [comments]
  • Open

    Mental secure hash function
    A few years ago I wrote about Manual Blum’s proposed method for mentally computing a secure hash function. He proposed using this method as a password manager, using the hash of a web site’s name as the password for the site. I first wrote about Blum’s method on the Heidelberg Laureate Forum blog, then wrote […] Mental secure hash function first appeared on John D. Cook.  ( 5 min )
  • Open

    Personalize your machine translation results by using fuzzy matching with Amazon Translate
    A person’s vernacular is part of the characteristics that make them unique. There are often countless different ways to express one specific idea. When a firm communicates with their customers, it’s critical that the message is delivered in a way that best represents the information they’re trying to convey. This becomes even more important when […]  ( 8 min )
  • Open

    FLUTE: A scalable federated learning simulation platform
    Federated learning has become a major area of machine learning (ML) research in recent years due to its versatility in training complex models over massive amounts of data without the need to share that data with a centralized entity. However, despite this flexibility and the amount of research already conducted, it’s difficult to implement due […] The post FLUTE: A scalable federated learning simulation platform appeared first on Microsoft Research.  ( 6 min )
  • Open

    A robust approach for deep neural networks in presence of label noise: relabelling and filtering instances during training. (arXiv:2109.03748v2 [cs.LG] UPDATED)
    Deep learning has outperformed other machine learning algorithms in a variety of tasks, and as a result, it is widely used. However, like other machine learning algorithms, deep learning, and convolutional neural networks (CNNs) in particular, perform worse when the data sets present label noise. Therefore, it is important to develop algorithms that help the training of deep networks and their generalization to noise-free test sets. In this paper, we propose a robust training strategy against label noise, called RAFNI, that can be used with any CNN. This algorithm filters and relabels instances of the training set based on the predictions and their probabilities made by the backbone neural network during the training process. That way, this algorithm improves the generalization ability of the CNN on its own. RAFNI consists of three mechanisms: two mechanisms that filter instances and one mechanism that relabels instances. In addition, it does not suppose that the noise rate is known nor does it need to be estimated. We evaluated our algorithm using different data sets of several sizes and characteristics. We also compared it with state-of-the-art models using the CIFAR10 and CIFAR100 benchmarks under different types and rates of label noise and found that RAFNI achieves better results in most cases.
    Automatic Monitoring of Fruit Ripening Rooms by UHF RFID Sensor Network and Machine Learning. (arXiv:2204.12415v2 [eess.SY] UPDATED)
    Accelerated ripening through the exposure of fruits to controlled environmental conditions and gases is nowadays one of the most assessed food technologies, especially for climacteric and exotic products. However, a fine granularity control of the process and consequently of the quality of the goods is still missing, so the management of the ripening rooms is mainly based on qualitative estimations only. Following the modern paradigms of Industry 4.0, this contribution proposes a non-destructive RFID-based system for the automatic evaluation of the live ripening of avocados. The system, coupled with a properly trained automatic classification algorithm based on Support Vector Machines (SVMs), can discriminate the stage of ripening with an accuracy greater than 85%.
    Intrinsically Motivated Self-supervised Learning in Reinforcement Learning. (arXiv:2106.13970v2 [cs.LG] UPDATED)
    In vision-based reinforcement learning (RL) tasks, it is prevalent to assign auxiliary tasks with a surrogate self-supervised loss so as to obtain more semantic representations and improve sample efficiency. However, abundant information in self-supervised auxiliary tasks has been disregarded, since the representation learning part and the decision-making part are separated. To sufficiently utilize information in auxiliary tasks, we present a simple yet effective idea to employ self-supervised loss as an intrinsic reward, called Intrinsically Motivated Self-Supervised learning in Reinforcement learning (IM-SSR). We formally show that the self-supervised loss can be decomposed as exploration for novel states and robustness improvement from nuisance elimination. IM-SSR can be effortlessly plugged into any reinforcement learning with self-supervised auxiliary objectives with nearly no additional cost. Combined with IM-SSR, the previous underlying algorithms achieve salient improvements on both sample efficiency and generalization in various vision-based robotics tasks from the DeepMind Control Suite, especially when the reward signal is sparse.
    PerfectDou: Dominating DouDizhu with Perfect Information Distillation. (arXiv:2203.16406v4 [cs.AI] UPDATED)
    As a challenging multi-player card game, DouDizhu has recently drawn much attention for analyzing competition and collaboration in imperfect-information games. In this paper, we propose PerfectDou, a state-of-the-art DouDizhu AI system that dominates the game, in an actor-critic framework with a proposed technique named perfect information distillation. In detail, we adopt a perfect-training-imperfect-execution framework that allows the agents to utilize the global information to guide the training of the policies as if it is a perfect information game and the trained policies can be used to play the imperfect information game during the actual gameplay. To this end, we characterize card and game features for DouDizhu to represent the perfect and imperfect information. To train our system, we adopt proximal policy optimization with generalized advantage estimation in a parallel training paradigm. In experiments we show how and why PerfectDou beats all existing AI programs, and achieves state-of-the-art performance.
    Accelerating Part-Scale Simulation in Liquid Metal Jet Additive Manufacturing via Operator Learning. (arXiv:2202.03665v1 [physics.flu-dyn] CROSS LISTED)
    Predicting part quality for additive manufacturing (AM) processes requires high-fidelity numerical simulation of partial differential equations (PDEs) governing process multiphysics on a scale of minimum manufacturable features. This makes part-scale predictions computationally demanding, especially when they require many small-scale simulations. We consider drop-on-demand liquid metal jetting (LMJ) as an illustrative example of such computational complexity. A model describing droplet coalescence for LMJ may include coupled incompressible fluid flow, heat transfer, and phase change equations. Numerically solving these equations becomes prohibitively expensive when simulating the build process for a full part consisting of thousands to millions of droplets. Reduced-order models (ROMs) based on neural networks (NN) or k-nearest neighbor (kNN) algorithms have been built to replace the original physics-based solver and are computationally tractable for part-level simulations. However, their quick inference capabilities often come at the expense of accuracy, robustness, and generalizability. We apply an operator learning (OL) approach to learn a mapping between initial and final states of the droplet coalescence process for enabling rapid and accurate part-scale build simulation. Preliminary results suggest that OL requires order-of-magnitude fewer data points than a kNN approach and is generalizable beyond the training set while achieving similar prediction error.
    Emotion Intensity and its Control for Emotional Voice Conversion. (arXiv:2201.03967v2 [cs.SD] UPDATED)
    Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity. In EVC, emotions are usually treated as discrete categories overlooking the fact that speech also conveys emotions with various intensity levels that the listener can perceive. In this paper, we aim to explicitly characterize and control the intensity of emotion. We propose to disentangle the speaker style from linguistic content and encode the speaker style into a style embedding in a continuous space that forms the prototype of emotion embedding. We further learn the actual emotion encoder from an emotion-labelled database and study the use of relative attributes to represent fine-grained emotion intensity. To ensure emotional intelligibility, we incorporate emotion classification loss and emotion embedding similarity loss into the training of the EVC network. As desired, the proposed network controls the fine-grained emotion intensity in the output speech. Through both objective and subjective evaluations, we validate the effectiveness of the proposed network for emotional expressiveness and emotion intensity control.
    Uncertify: Attacks Against Neural Network Certification. (arXiv:2108.11299v3 [cs.LG] UPDATED)
    A key concept towards reliable, robust, and safe AI systems is the idea to implement fallback strategies when predictions of the AI cannot be trusted. Certifiers for neural networks have made great progress towards provable robustness guarantees against evasion attacks using adversarial examples. These methods guarantee for some predictions that a certain class of manipulations or attacks could not have changed the outcome. For the remaining predictions without guarantees, the method abstains from making a prediction and a fallback strategy needs to be invoked, which is typically more costly, less accurate, or even involves a human operator. While this is a key concept towards safe and secure AI, we show for the first time that this strategy comes with its own security risks, as such fallback strategies can be deliberately triggered by an adversary. In particular, we conduct the first systematic analysis of training-time attacks against certifiers in practical application pipelines, identifying new threat vectors that can be exploited to degrade the overall system. Using these insights, we design two backdoor attacks against network certifiers, which can drastically reduce certified robustness. For example, adding 1% poisoned data during training is sufficient to reduce certified robustness by up to 95 percentage points, effectively rendering the certifier useless. We analyze how such novel attacks can compromise the overall system's integrity or availability. Our extensive experiments across multiple datasets, model architectures, and certifiers demonstrate the wide applicability of these attacks. A first investigation into potential defenses shows that current approaches are insufficient to mitigate the issue, highlighting the need for new, more specific solutions.
    MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound. (arXiv:2201.02639v4 [cs.CV] UPDATED)
    As humans, we navigate a multimodal world, building a holistic understanding from all our senses. We introduce MERLOT Reserve, a model that represents videos jointly over time -- through a new training objective that learns from audio, subtitles, and video frames. Given a video, we replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet. Our objective learns faster than alternatives, and performs well at scale: we pretrain on 20 million YouTube videos. Empirical results show that MERLOT Reserve learns strong multimodal representations. When finetuned, it sets state-of-the-art on Visual Commonsense Reasoning (VCR), TVQA, and Kinetics-600; outperforming prior work by 5%, 7%, and 1.5% respectively. Ablations show that these tasks benefit from audio pretraining -- even VCR, a QA task centered around images (without sound). Moreover, our objective enables out-of-the-box prediction, revealing strong multimodal commonsense understanding. In a fully zero-shot setting, our model obtains competitive results on four video tasks, even outperforming supervised approaches on the recently proposed Situated Reasoning (STAR) benchmark. We analyze why audio enables better vision-language representations, suggesting significant opportunities for future research. We conclude by discussing ethical and societal implications of multimodal pretraining.
    Neurochaos Feature Transformation and Classification for Imbalanced Learning. (arXiv:2205.06742v1 [cs.NE])
    Learning from limited and imbalanced data is a challenging problem in the Artificial Intelligence community. Real-time scenarios demand decision-making from rare events wherein the data are typically imbalanced. These situations commonly arise in medical applications, cybersecurity, catastrophic predictions etc. This motivates development of learning algorithms capable of learning from imbalanced data. Human brain effortlessly learns from imbalanced data. Inspired by the chaotic neuronal firing in the human brain, a novel learning algorithm namely \emph{Neurochaos Learning} (NL) was recently proposed. NL is categorized in three blocks: Feature Transformation, Neurochaos Feature Extraction (CFX), and Classification. In this work, the efficacy of neurochaos feature transformation and extraction for classification in imbalanced learning is studied. We propose a unique combination of neurochaos based feature transformation and extraction with traditional ML algorithms. The explored datasets in this study revolve around medical diagnosis, banknote fraud detection, environmental applications and spoken-digit classification. In this study, experiments are performed in both high and low training sample regime. In the former, five out of nine datasets have shown a performance boost in terms of macro F1-score after using CFX features. The highest performance boost obtained is $\textbf{25.97\%}$ for {\it Statlog (Heart)} dataset using CFX+Decision Tree. In the low training sample regime (from just one to nine training samples per class), the highest performance boost of $\textbf{144.38\%}$ is obtained for {\it Haberman's Survival} dataset using CFX+Random Forest. NL offers enormous flexibility of combining CFX with any ML classifier to boost its performance, especially for learning tasks with limited and imbalanced data.
    Scaling the weight parameters in Markov logic networks and relational logistic regression models. (arXiv:2103.15140v2 [cs.AI] UPDATED)
    We consider Markov logic networks and relational logistic regression as two fundamental representation formalisms in statistical relational artificial intelligence that use weighted formulas in their specification. However, Markov logic networks are based on undirected graphs, while relational logistic regression is based on directed acyclic graphs. We show that when scaling the weight parameters with the domain size, the asymptotic behaviour of a relational logistic regression model is transparently controlled by the parameters, and we supply an algorithm to compute asymptotic probabilities. We also show using two examples that this is not true for Markov logic networks. We also discuss using several examples, mainly from the literature, how the application context can help the user to decide when such scaling is appropriate and when using the raw unscaled parameters might be preferable. We highlight random sampling as a particularly promising area of application for scaled models and expound possible avenues for further research.
    No Weighted-Regret Learning in Adversarial Bandits with Delays. (arXiv:2103.04550v2 [cs.LG] UPDATED)
    Consider a scenario where a player chooses an action in each round $t$ out of $T$ rounds and observes the incurred cost after a delay of $d_{t}$ rounds. The cost functions and the delay sequence are chosen by an adversary. We show that in a non-cooperative game, the expected weighted ergodic distribution of play converges to the set of coarse correlated equilibria if players use algorithms that have "no weighted-regret" in the above scenario, even if they have linear regret due to too large delays. For a two-player zero-sum game, we show that no weighted-regret is sufficient for the weighted ergodic average of play to converge to the set of Nash equilibria. We prove that the FKM algorithm with $n$ dimensions achieves an expected regret of $O\left(nT^{\frac{3}{4}}+\sqrt{n}T^{\frac{1}{3}}D^{\frac{1}{3}}\right)$ and the EXP3 algorithm with $K$ arms achieves an expected regret of $O\left(\sqrt{\log K\left(KT+D\right)}\right)$ even when $D=\sum_{t=1}^{T}d_{t}$ and $T$ are unknown. These bounds use a novel doubling trick that, under mild assumptions, provably retains the regret bound for when $D$ and $T$ are known. Using these bounds, we show that FKM and EXP3 have no weighted-regret even for $d_{t}=O\left(t\log t\right)$. Therefore, algorithms with no weighted-regret can be used to approximate a CCE of a finite or convex unknown game that can only be simulated with bandit feedback, even if the simulation involves significant delays.
    secml: A Python Library for Secure and Explainable Machine Learning. (arXiv:1912.10013v2 [cs.LG] UPDATED)
    We present \texttt{secml}, an open-source Python library for secure and explainable machine learning. It implements the most popular attacks against machine learning, including test-time evasion attacks to generate adversarial examples against deep neural networks and training-time poisoning attacks against support vector machines and many other algorithms. These attacks enable evaluating the security of learning algorithms and the corresponding defenses under both white-box and black-box threat models. To this end, \texttt{secml} provides built-in functions to compute security evaluation curves, showing how quickly classification performance decreases against increasing adversarial perturbations of the input data. \texttt{secml} also includes explainability methods to help understand why adversarial attacks succeed against a given model, by visualizing the most influential features and training prototypes contributing to each decision. It is distributed under the Apache License 2.0 and hosted at \url{https://github.com/pralab/secml}.
    On the validity of pre-trained transformers for natural language processing in the software engineering domain. (arXiv:2109.04738v2 [cs.SE] UPDATED)
    Transformers are the current state-of-the-art of natural language processing in many domains and are using traction within software engineering research as well. Such models are pre-trained on large amounts of data, usually from the general domain. However, we only have a limited understanding regarding the validity of transformers within the software engineering domain, i.e., how good such models are at understanding words and sentences within a software engineering context and how this improves the state-of-the-art. Within this article, we shed light on this complex, but crucial issue. We compare BERT transformer models trained with software engineering data with transformers based on general domain data in multiple dimensions: their vocabulary, their ability to understand which words are missing, and their performance in classification tasks. Our results show that for tasks that require understanding of the software engineering context, pre-training with software engineering data is valuable, while general domain models are sufficient for general language understanding, also within the software engineering domain.
    Research on the correlation between text emotion mining and stock market based on deep learning. (arXiv:2205.06675v1 [q-fin.ST])
    This paper discusses how to crawl the data of financial forums such as stock bar, and conduct emotional analysis combined with the in-depth learning model. This paper will use the Bert model to train the financial corpus and predict the Shenzhen stock index. Through the comparative study of the maximal information coefficient (MIC), it is found that the emotional characteristics obtained by applying the BERT model to the financial corpus can be reflected in the fluctuation of the stock market, which is conducive to effectively improve the prediction accuracy. At the same time, this paper combines in-depth learning with financial texts to further explore the impact mechanism of investor sentiment on the stock market through in-depth learning, which will help the national regulatory authorities and policy departments to formulate more reasonable policies and guidelines for maintaining the stability of the stock market.
    Artificial Intelligence-Assisted Optimization and Multiphase Analysis of Polygon PEM Fuel Cells. (arXiv:2205.06768v1 [cs.NE])
    This article presents new PEM fuel cell models with hexagonal and pentagonal designs. After observing cell performance improvement in these models, we optimized them. Inlet pressure and temperature were used as input parameters, and consumption and output power were the target parameters of the multi-objective optimization algorithm. Then we used artificial intelligence techniques, including deep neural networks and polynomial regression, to model the data. Next, we employed the RSM (Response Surface Method) method to derive the target functions. Furthermore, we applied the NSGA-II multi-objective genetic algorithm to optimize the targets. Compared to the base model (Cubic), the optimized Pentagonal and Hexagonal models averagely increase the output current density by 21.819% and 39.931%, respectively.
    Principal-Agent Hypothesis Testing. (arXiv:2205.06812v1 [cs.GT])
    Consider the relationship between the FDA (the principal) and a pharmaceutical company (the agent). The pharmaceutical company wishes to sell a product to make a profit, and the FDA wishes to ensure that only efficacious drugs are released to the public. The efficacy of the drug is not known to the FDA, so the pharmaceutical company must run a costly trial to prove efficacy to the FDA. Critically, the statistical protocol used to establish efficacy affects the behavior of a strategic, self-interested pharmaceutical company; a lower standard of statistical evidence incentivizes the pharmaceutical company to run more trials for drugs that are less likely to be effective, since the drug may pass the trial by chance, resulting in large profits. The interaction between the statistical protocol and the incentives of the pharmaceutical company is crucial to understanding this system and designing protocols with high social utility. In this work, we discuss how the principal and agent can enter into a contract with payoffs based on statistical evidence. When there is stronger evidence for the quality of the product, the principal allows the agent to make a larger profit. We show how to design contracts that are robust to an agent's strategic actions, and derive the optimal contract in the presence of strategic behavior.
    Kronecker Decomposition for Knowledge Graph Embeddings. (arXiv:2205.06560v1 [cs.LG])
    Knowledge graph embedding research has mainly focused on learning continuous representations of entities and relations tailored towards the link prediction problem. Recent results indicate an ever increasing predictive ability of current approaches on benchmark datasets. However, this effectiveness often comes with the cost of over-parameterization and increased computationally complexity. The former induces extensive hyperparameter optimization to mitigate malicious overfitting. The latter magnifies the importance of winning the hardware lottery. Here, we investigate a remedy for the first problem. We propose a technique based on Kronecker decomposition to reduce the number of parameters in a knowledge graph embedding model, while retaining its expressiveness. Through Kronecker decomposition, large embedding matrices are split into smaller embedding matrices during the training process. Hence, embeddings of knowledge graphs are not plainly retrieved but reconstructed on the fly. The decomposition ensures that elementwise interactions between three embedding vectors are extended with interactions within each embedding vector. This implicitly reduces redundancy in embedding vectors and encourages feature reuse. To quantify the impact of applying Kronecker decomposition on embedding matrices, we conduct a series of experiments on benchmark datasets. Our experiments suggest that applying Kronecker decomposition on embedding matrices leads to an improved parameter efficiency on all benchmark datasets. Moreover, empirical evidence suggests that reconstructed embeddings entail robustness against noise in the input knowledge graph. To foster reproducible research, we provide an open-source implementation of our approach, including training and evaluation scripts as well as pre-trained models in our knowledge graph embedding framework (https://github.com/dice-group/dice-embeddings).
    Graph Attention Networks for Channel Estimation in RIS-assisted Satellite IoT Communications. (arXiv:2104.00735v3 [cs.NI] UPDATED)
    Direct-to-satellite (DtS) communication has gained importance recently to support globally connected Internet of things (IoT) networks. However, relatively long distances of densely deployed satellite networks around the Earth cause a high path loss. In addition, since high complexity operations such as beamforming, tracking and equalization have to be performed in IoT devices partially, both the hardware complexity and the need for high-capacity batteries of IoT devices increase. The reconfigurable intelligent surfaces (RISs) have the potential to increase the energy-efficiency and to perform complex signal processing over the transmission environment instead of IoT devices. But, RISs need the information of the cascaded channel in order to change the phase of the incident signal. This study evaluates the pilot signal as a graph and incorporates this information into the graph attention networks (GATs) to track the phase relation through pilot signaling. Proposed GAT based channel estimation method examines the performance of the DtS IoT networks for different RIS configurations to solve the challenging channel estimation problem. It is shown that the proposed GAT both demonstrates a higher performance with increased robustness under changing conditions and has lower computational complexity compared to conventional deep learning methods. Moreover, bit error rate performance is investigated for RIS designs with discrete and non-uniform phase shifts under channel estimation based on the proposed method. One of the findings in this study is that the channel models of the operating environment and the performance of the channel estimation method must be considered during RIS design to exploit performance improvement as far as possible.
    On the Importance of Architecture and Feature Selection in Differentially Private Machine Learning. (arXiv:2205.06720v1 [cs.CR])
    We study a pitfall in the typical workflow for differentially private machine learning. The use of differentially private learning algorithms in a "drop-in" fashion -- without accounting for the impact of differential privacy (DP) noise when choosing what feature engineering operations to use, what features to select, or what neural network architecture to use -- yields overly complex and poorly performing models. In other words, by anticipating the impact of DP noise, a simpler and more accurate alternative model could have been trained for the same privacy guarantee. We systematically study this phenomenon through theory and experiments. On the theory front, we provide an explanatory framework and prove that the phenomenon arises naturally from the addition of noise to satisfy differential privacy. On the experimental front, we demonstrate how the phenomenon manifests in practice using various datasets, types of models, tasks, and neural network architectures. We also analyze the factors that contribute to the problem and distill our experimental insights into concrete takeaways that practitioners can follow when training models with differential privacy. Finally, we propose privacy-aware algorithms for feature selection and neural network architecture search. We analyze their differential privacy properties and evaluate them empirically.
    Explaining by Removing: A Unified Framework for Model Explanation. (arXiv:2011.14878v2 [cs.LG] UPDATED)
    Researchers have proposed a wide variety of model explanation approaches, but it remains unclear how most methods are related or when one method is preferable to another. We describe a new unified class of methods, removal-based explanations, that are based on the principle of simulating feature removal to quantify each feature's influence. These methods vary in several respects, so we develop a framework that characterizes each method along three dimensions: 1) how the method removes features, 2) what model behavior the method explains, and 3) how the method summarizes each feature's influence. Our framework unifies 26 existing methods, including several of the most widely used approaches: SHAP, LIME, Meaningful Perturbations, and permutation tests. This newly understood class of explanation methods has rich connections that we examine using tools that have been largely overlooked by the explainability literature. To anchor removal-based explanations in cognitive psychology, we show that feature removal is a simple application of subtractive counterfactual reasoning. Ideas from cooperative game theory shed light on the relationships and trade-offs among different methods, and we derive conditions under which all removal-based explanations have information-theoretic interpretations. Through this analysis, we develop a unified framework that helps practitioners better understand model explanation tools, and that offers a strong theoretical foundation upon which future explainability research can build.
    Variational Hyper-Encoding Networks. (arXiv:2005.08482v2 [stat.ML] UPDATED)
    We propose a framework called HyperVAE for encoding distributions of distributions. When a target distribution is modeled by a VAE, its neural network parameters \theta is drawn from a distribution p(\theta) which is modeled by a hyper-level VAE. We propose a variational inference using Gaussian mixture models to implicitly encode the parameters \theta into a low dimensional Gaussian distribution. Given a target distribution, we predict the posterior distribution of the latent code, then use a matrix-network decoder to generate a posterior distribution q(\theta). HyperVAE can encode the parameters \theta in full in contrast to common hyper-networks practices, which generate only the scale and bias vectors as target-network parameters. Thus HyperVAE preserves much more information about the model for each task in the latent space. We discuss HyperVAE using the minimum description length (MDL) principle and show that it helps HyperVAE to generalize. We evaluate HyperVAE in density estimation tasks, outlier detection and discovery of novel design classes, demonstrating its efficacy.
    Distributed Transmission Control for Wireless Networks using Multi-Agent Reinforcement Learning. (arXiv:2205.06800v1 [cs.LG])
    We examine the problem of transmission control, i.e., when to transmit, in distributed wireless communications networks through the lens of multi-agent reinforcement learning. Most other works using reinforcement learning to control or schedule transmissions use some centralized control mechanism, whereas our approach is fully distributed. Each transmitter node is an independent reinforcement learning agent and does not have direct knowledge of the actions taken by other agents. We consider the case where only a subset of agents can successfully transmit at a time, so each agent must learn to act cooperatively with other agents. An agent may decide to transmit a certain number of steps into the future, but this decision is not communicated to the other agents, so it the task of the individual agents to attempt to transmit at appropriate times. We achieve this collaborative behavior through studying the effects of different actions spaces. We are agnostic to the physical layer, which makes our approach applicable to many types of networks. We submit that approaches similar to ours may be useful in other domains that use multi-agent reinforcement learning with independent agents.
    Goal-Guided Neural Cellular Automata: Learning to Control Self-Organising Systems. (arXiv:2205.06806v1 [cs.NE])
    Inspired by cellular growth and self-organization, Neural Cellular Automata (NCAs) have been capable of "growing" artificial cells into images, 3D structures, and even functional machines. NCAs are flexible and robust computational systems but -- similarly to many other self-organizing systems -- inherently uncontrollable during and after their growth process. We present an approach to control these type of systems called Goal-Guided Neural Cellular Automata (GoalNCA), which leverages goal encodings to control cell behavior dynamically at every step of cellular growth. This approach enables the NCA to continually change behavior, and in some cases, generalize its behavior to unseen scenarios. We also demonstrate the robustness of the NCA with its ability to preserve task performance, even when only a portion of cells receive goal information.
    Provably Safe Reinforcement Learning: A Theoretical and Experimental Comparison. (arXiv:2205.06750v1 [cs.LG])
    Ensuring safety of reinforcement learning (RL) algorithms is crucial for many real-world tasks. However, vanilla RL does not guarantee safety for an agent. In recent years, several methods have been proposed to provide safety guarantees for RL. To the best of our knowledge, there is no comprehensive comparison of these provably safe RL methods. We therefore introduce a categorization for existing provably safe RL methods, and present the theoretical foundations for both continuous and discrete action spaces. Additionally, we evaluate provably safe RL on an inverted pendulum. In the experiments, it is shown that indeed only provably safe RL methods guarantee safety.
    Precise Change Point Detection using Spectral Drift Detection. (arXiv:2205.06507v1 [cs.LG])
    The notion of concept drift refers to the phenomenon that the data generating distribution changes over time; as a consequence machine learning models may become inaccurate and need adjustment. In this paper we consider the problem of detecting those change points in unsupervised learning. Many unsupervised approaches rely on the discrepancy between the sample distributions of two time windows. This procedure is noisy for small windows, hence prone to induce false positives and not able to deal with more than one drift event in a window. In this paper we rely on structural properties of drift induced signals, which use spectral properties of kernel embedding of distributions. Based thereon we derive a new unsupervised drift detection algorithm, investigate its mathematical properties, and demonstrate its usefulness in several experiments.
    Transformation-Interaction-Rational Representation for Symbolic Regression. (arXiv:2205.06807v1 [cs.NE])
    Symbolic Regression searches for a function form that approximates a dataset often using Genetic Programming. Since there is usually no restriction to what form the function can have, Genetic Programming may return a hard to understand model due to non-linear function chaining or long expressions. A novel representation called Interaction-Transformation was recently proposed to alleviate this problem. In this representation, the function form is restricted to an affine combination of terms generated as the application of a single univariate function to the interaction of selected variables. This representation obtained competing solutions on standard benchmarks. Despite the initial success, a broader set of benchmarking functions revealed the limitations of the constrained representation. In this paper we propose an extension to this representation, called Transformation-Interaction-Rational representation that defines a new function form as the rational of two Interaction-Transformation functions. Additionally, the target variable can also be transformed with an univariate function. The main goal is to improve the approximation power while still constraining the overall complexity of the expression. We tested this representation with a standard Genetic Programming with crossover and mutation. The results show a great improvement when compared to its predecessor and a state-of-the-art performance for a large benchmark.
    Interlock-Free Multi-Aspect Rationalization for Text Classification. (arXiv:2205.06756v1 [cs.CL])
    Explanation is important for text classification tasks. One prevalent type of explanation is rationales, which are text snippets of input text that suffice to yield the prediction and are meaningful to humans. A lot of research on rationalization has been based on the selective rationalization framework, which has recently been shown to be problematic due to the interlocking dynamics. In this paper, we show that we address the interlocking problem in the multi-aspect setting, where we aim to generate multiple rationales for multiple outputs. More specifically, we propose a multi-stage training method incorporating an additional self-supervised contrastive loss that helps to generate more semantically diverse rationales. Empirical results on the beer review dataset show that our method improves significantly the rationalization performance.
    Embodied-Symbolic Contrastive Graph Self-Supervised Learning for Molecular Graphs. (arXiv:2205.06783v1 [cs.LG])
    Dual embodied-symbolic concept representations are the foundation for deep learning and symbolic AI integration. We discuss the use of dual embodied-symbolic concept representations for molecular graph representation learning, specifically with exemplar-based contrastive self-supervised learning (SSL). The embodied representations are learned from molecular graphs, and the symbolic representations are learned from the corresponding Chemical knowledge graph (KG). We use the Chemical KG to enhance molecular graphs with symbolic (semantic) knowledge and generate their augmented molecular graphs. We treat a molecular graph and its semantically augmented molecular graph as exemplars of the same semantic class, and use the pairs as positive pairs in exemplar-based contrastive SSL.
    Self-Sampling for Neural Point Cloud Consolidation. (arXiv:2008.06471v3 [cs.GR] UPDATED)
    We introduce a novel technique for neural point cloud consolidation which learns from only the input point cloud. Unlike other point upsampling methods which analyze shapes via local patches, in this work, we learn from global subsets. We repeatedly self-sample the input point cloud with global subsets that are used to train a deep neural network. Specifically, we define source and target subsets according to the desired consolidation criteria (e.g., generating sharp points or points in sparse regions). The network learns a mapping from source to target subsets, and implicitly learns to consolidate the point cloud. During inference, the network is fed with random subsets of points from the input, which it displaces to synthesize a consolidated point set. We leverage the inductive bias of neural networks to eliminate noise and outliers, a notoriously difficult problem in point cloud consolidation. The shared weights of the network are optimized over the entire shape, learning non-local statistics and exploiting the recurrence of local-scale geometries. Specifically, the network encodes the distribution of the underlying shape surface within a fixed set of local kernels, which results in the best explanation of the underlying shape surface. We demonstrate the ability to consolidate point sets from a variety of shapes, while eliminating outliers and noise.
    Exploring the structure-property relations of thin-walled, 2D extruded lattices using neural networks. (arXiv:2205.06761v1 [cs.LG])
    This paper investigates the structure-property relations of thin-walled lattices under dynamic longitudinal compression, characterized by their cross-sections and heights. These relations elucidate the interactions of different geometric features of a design on mechanical response, including energy absorption. We proposed a combinatorial, key-based design system to generate different lattice designs and used the finite element method to simulate their response with the Johnson-Cook material model. Using an autoencoder, we encoded the cross-sectional images of the lattices into latent design feature vectors, which were supplied to the neural network model to generate predictions. The trained models can accurately predict lattice energy absorption curves in the key-based design system and can be extended to new designs outside of the system via transfer learning.
    Emergent Bartering Behaviour in Multi-Agent Reinforcement Learning. (arXiv:2205.06760v1 [cs.AI])
    Advances in artificial intelligence often stem from the development of new environments that abstract real-world situations into a form where research can be done conveniently. This paper contributes such an environment based on ideas inspired by elementary Microeconomics. Agents learn to produce resources in a spatially complex world, trade them with one another, and consume those that they prefer. We show that the emergent production, consumption, and pricing behaviors respond to environmental conditions in the directions predicted by supply and demand shifts in Microeconomics. We also demonstrate settings where the agents' emergent prices for goods vary over space, reflecting the local abundance of goods. After the price disparities emerge, some agents then discover a niche of transporting goods between regions with different prevailing prices -- a profitable strategy because they can buy goods where they are cheap and sell them where they are expensive. Finally, in a series of ablation experiments, we investigate how choices in the environmental rewards, bartering actions, agent architecture, and ability to consume tradable goods can either aid or inhibit the emergence of this economic behavior. This work is part of the environment development branch of a research program that aims to build human-like artificial general intelligence through multi-agent interactions in simulated societies. By exploring which environment features are needed for the basic phenomena of elementary microeconomics to emerge automatically from learning, we arrive at an environment that differs from those studied in prior multi-agent reinforcement learning work along several dimensions. For example, the model incorporates heterogeneous tastes and physical abilities, and agents negotiate with one another as a grounded form of communication.
    Imaging Conductivity from Current Density Magnitude using Neural Networks. (arXiv:2204.02441v3 [math.NA] UPDATED)
    Conductivity imaging represents one of the most important tasks in medical imaging. In this work we develop a neural network based reconstruction technique for imaging the conductivity from the magnitude of the internal current density. It is achieved by formulating the problem as a relaxed weighted least-gradient problem, and then approximating its minimizer by standard fully connected feedforward neural networks. We derive bounds on two components of the generalization error, i.e., approximation error and statistical error, explicitly in terms of properties of the neural networks (e.g., depth, total number of parameters, and the bound of the network parameters). We illustrate the performance and distinct features of the approach on several numerical experiments. Numerically, it is observed that the approach enjoys remarkable robustness with respect to the presence of data noise.
    Analyzing Hate Speech Data along Racial, Gender and Intersectional Axes. (arXiv:2205.06621v1 [cs.CL])
    To tackle the rising phenomenon of hate speech, efforts have been made towards data curation and analysis. When it comes to analysis of bias, previous work has focused predominantly on race. In our work, we further investigate bias in hate speech datasets along racial, gender and intersectional axes. We identify strong bias against African American English (AAE), masculine and AAE+Masculine tweets, which are annotated as disproportionately more hateful and offensive than from other demographics. We provide evidence that BERT-based models propagate this bias and show that balancing the training data for these protected attributes can lead to fairer models with regards to gender, but not race.
    Learning Keypoints from Synthetic Data for Robotic Cloth Folding. (arXiv:2205.06714v1 [cs.RO])
    Robotic cloth manipulation is challenging due to its deformability, which makes determining its full state infeasible. However, for cloth folding, it suffices to know the position of a few semantic keypoints. Convolutional neural networks (CNN) can be used to detect these keypoints, but require large amounts of annotated data, which is expensive to collect. To overcome this, we propose to learn these keypoint detectors purely from synthetic data, enabling low-cost data collection. In this paper, we procedurally generate images of towels and use them to train a CNN. We evaluate the performance of this detector for folding towels on a unimanual robot setup and find that the grasp and fold success rates are 77% and 53%, respectively. We conclude that learning keypoint detectors from synthetic data for cloth folding and related tasks is a promising research direction, discuss some failures and relate them to future work. A video of the system, as well as the codebase, more details on the CNN architecture and the training setup can be found at https://github.com/tlpss/workshop-icra-2022-cloth-keypoints.git.
    Federated Learning Under Intermittent Client Availability and Time-Varying Communication Constraints. (arXiv:2205.06730v1 [cs.LG])
    Federated learning systems facilitate training of global models in settings where potentially heterogeneous data is distributed across a large number of clients. Such systems operate in settings with intermittent client availability and/or time-varying communication constraints. As a result, the global models trained by federated learning systems may be biased towards clients with higher availability. We propose F3AST, an unbiased algorithm that dynamically learns an availability-dependent client selection strategy which asymptotically minimizes the impact of client-sampling variance on the global model convergence, enhancing performance of federated learning. The proposed algorithm is tested in a variety of settings for intermittently available clients under communication constraints, and its efficacy demonstrated on synthetic data and realistically federated benchmarking experiments using CIFAR100 and Shakespeare datasets. We show up to 186% and 8% accuracy improvements over FedAvg, and 8% and 7% over FedAdam on CIFAR100 and Shakespeare, respectively.
    EyeDAS: Securing Perception of Autonomous Cars Against the Stereoblindness Syndrome. (arXiv:2205.06765v1 [cs.LG])
    The ability to detect whether an object is a 2D or 3D object is extremely important in autonomous driving, since a detection error can have life-threatening consequences, endangering the safety of the driver, passengers, pedestrians, and others on the road. Methods proposed to distinguish between 2 and 3D objects (e.g., liveness detection methods) are not suitable for autonomous driving, because they are object dependent or do not consider the constraints associated with autonomous driving (e.g., the need for real-time decision-making while the vehicle is moving). In this paper, we present EyeDAS, a novel few-shot learning-based method aimed at securing an object detector (OD) against the threat posed by the stereoblindness syndrome (i.e., the inability to distinguish between 2D and 3D objects). We evaluate EyeDAS's real-time performance using 2,000 objects extracted from seven YouTube video recordings of street views taken by a dash cam from the driver's seat perspective. When applying EyeDAS to seven state-of-the-art ODs as a countermeasure, EyeDAS was able to reduce the 2D misclassification rate from 71.42-100% to 2.4% with a 3D misclassification rate of 0% (TPR of 1.0). We also show that EyeDAS outperforms the baseline method and achieves an AUC of over 0.999 and a TPR of 1.0 with an FPR of 0.024.
    Univariate and Multivariate LSTM Model for Short-Term Stock Market Prediction. (arXiv:2205.06673v1 [q-fin.ST])
    Designing robust and accurate prediction models has been a viable research area since a long time. While proponents of a well-functioning market predictors believe that it is difficult to accurately predict market prices but many scholars disagree. Robust and accurate prediction systems will not only be helpful to the businesses but also to the individuals in making their financial investments. This paper presents an LSTM model with two different input approaches for predicting the short-term stock prices of two Indian companies, Reliance Industries and Infosys Ltd. Ten years of historic data (2012-2021) is taken from the yahoo finance website to carry out analysis of proposed approaches. In the first approach, closing prices of two selected companies are directly applied on univariate LSTM model. For the approach second, technical indicators values are calculated from the closing prices and then collectively applied on Multivariate LSTM model. Short term market behaviour for upcoming days is evaluated. Experimental outcomes revel that approach one is useful to determine the future trend but multivariate LSTM model with technical indicators found to be useful in accurately predicting the future price behaviours.
    Upside-Down Reinforcement Learning Can Diverge in Stochastic Environments With Episodic Resets. (arXiv:2205.06595v1 [stat.ML])
    Upside-Down Reinforcement Learning (UDRL) is an approach for solving RL problems that does not require value functions and uses only supervised learning, where the targets for given inputs in a dataset do not change over time. Ghosh et al. proved that Goal-Conditional Supervised Learning (GCSL) -- which can be viewed as a simplified version of UDRL -- optimizes a lower bound on goal-reaching performance. This raises expectations that such algorithms may enjoy guaranteed convergence to the optimal policy in arbitrary environments, similar to certain well-known traditional RL algorithms. Here we show that for a specific episodic UDRL algorithm (eUDRL, including GCSL), this is not the case, and give the causes of this limitation. To do so, we first introduce a helpful rewrite of eUDRL as a recursive policy update. This formulation helps to disprove its convergence to the optimal policy for a wide class of stochastic environments. Finally, we provide a concrete example of a very simple environment where eUDRL diverges. Since the primary aim of this paper is to present a negative result, and the best counterexamples are the simplest ones, we restrict all discussions to finite (discrete) environments, ignoring issues of function approximation and limited sample size.
    A Vision Inspired Neural Network for Unsupervised Anomaly Detection in Unordered Data. (arXiv:2205.06716v1 [cs.LG])
    A fundamental problem in the field of unsupervised machine learning is the detection of anomalies corresponding to rare and unusual observations of interest; reasons include for their rejection, accommodation or further investigation. Anomalies are intuitively understood to be something unusual or inconsistent, whose occurrence sparks immediate attention. More formally anomalies are those observations-under appropriate random variable modelling-whose expectation of occurrence with respect to a grouping of prior interest is less than one; such a definition and understanding has been used to develop the parameter-free perception anomaly detection algorithm. The present work seeks to establish important and practical connections between the approach used by the perception algorithm and prior decades of research in neurophysiology and computational neuroscience; particularly that of information processing in the retina and visual cortex. The algorithm is conceptualised as a neuron model which forms the kernel of an unsupervised neural network that learns to signal unexpected observations as anomalies. Both the network and neuron display properties observed in biological processes including: immediate intelligence; parallel processing; redundancy; global degradation; contrast invariance; parameter-free computation, dynamic thresholds and non-linear processing. A robust and accurate model for anomaly detection in univariate and multivariate data is built using this network as a concrete application.
    Local Attention Graph-based Transformer for Multi-target Genetic Alteration Prediction. (arXiv:2205.06672v1 [cs.CV])
    Classical multiple instance learning (MIL) methods are often based on the identical and independent distributed assumption between instances, hence neglecting the potentially rich contextual information beyond individual entities. On the other hand, Transformers with global self-attention modules have been proposed to model the interdependencies among all instances. However, in this paper we question: Is global relation modeling using self-attention necessary, or can we appropriately restrict self-attention calculations to local regimes in large-scale whole slide images (WSIs)? We propose a general-purpose local attention graph-based Transformer for MIL (LA-MIL), introducing an inductive bias by explicitly contextualizing instances in adaptive local regimes of arbitrary size. Additionally, an efficiently adapted loss function enables our approach to learn expressive WSI embeddings for the joint analysis of multiple biomarkers. We demonstrate that LA-MIL achieves state-of-the-art results in mutation prediction for gastrointestinal cancer, outperforming existing models on important biomarkers such as microsatellite instability for colorectal cancer. This suggests that local self-attention sufficiently models dependencies on par with global modules. Our implementation will be published.
    Multiple Domain Causal Networks. (arXiv:2205.06791v1 [stat.ML])
    Observational studies are regarded as economic alternatives to randomized trials, often used in their stead to investigate and determine treatment efficacy. Due to lack of sample size, observational studies commonly combine data from multiple sources or different sites/centers. Despite the benefits of an increased sample size, a naive combination of multicenter data may result in incongruities stemming from center-specific protocols for generating cohorts or reactions towards treatments distinct to a given center, among other things. These issues arise in a variety of other contexts, including capturing a treatment effect related to an individual's unique biological characteristics. Existing methods for estimating heterogeneous treatment effects have not adequately addressed the multicenter context, but rather treat it simply as a means to obtain sufficient sample size. Additionally, previous approaches to estimating treatment effects do not straightforwardly generalize to the multicenter design, especially when required to provide treatment insights for patients from a new, unobserved center. To address these shortcomings, we propose Multiple Domain Causal Networks (MDCN), an approach that simultaneously strengthens the information sharing between similar centers while addressing the selection bias in treatment assignment through learning of a new feature embedding. In empirical evaluations, MDCN is consistently more accurate when estimating the heterogeneous treatment effect in new centers compared to benchmarks that adjust solely based on treatment imbalance or general center differences. Finally, we justify our approach by providing theoretical analyses that demonstrate that MDCN improves on the generalization bound of the new, unobserved target center.
    The Devil is in the Details: On the Pitfalls of Vocabulary Selection in Neural Machine Translation. (arXiv:2205.06618v1 [cs.CL])
    Vocabulary selection, or lexical shortlisting, is a well-known technique to improve latency of Neural Machine Translation models by constraining the set of allowed output words during inference. The chosen set is typically determined by separately trained alignment model parameters, independent of the source-sentence context at inference time. While vocabulary selection appears competitive with respect to automatic quality metrics in prior work, we show that it can fail to select the right set of output words, particularly for semantically non-compositional linguistic phenomena such as idiomatic expressions, leading to reduced translation quality as perceived by humans. Trading off latency for quality by increasing the size of the allowed set is often not an option in real-world scenarios. We propose a model of vocabulary selection, integrated into the neural translation model, that predicts the set of allowed output words from contextualized encoder representations. This restores translation quality of an unconstrained system, as measured by human evaluations on WMT newstest2020 and idiomatic expressions, at an inference latency competitive with alignment-based selection using aggressive thresholds, thereby removing the dependency on separately trained alignment models.
    Accelerometry-based classification of circulatory states during out-of-hospital cardiac arrest. (arXiv:2205.06540v1 [eess.SP])
    Objective: During cardiac arrest treatment, a reliable detection of spontaneous circulation, usually performed by manual pulse checks, is both vital for patient survival and practically challenging. Methods: We developed a machine learning algorithm to automatically predict the circulatory state during cardiac arrest treatment from 4-second-long snippets of accelerometry and electrocardiogram data from real-world defibrillator records. The algorithm was trained based on 917 cases from the German Resuscitation Registry, for which ground truth labels were created by a manual annotation of physicians. It uses a kernelized Support Vector Machine classifier based on 14 features, which partially reflect the correlation between accelerometry and electrocardiogram data. Results: On a test data set, the proposed algorithm exhibits an accuracy of 94.4 (93.6, 95.2)%, a sensitivity of 95.0 (93.9, 96.1)%, and a specificity of 93.9 (92.7, 95.1)%. Conclusion and significance: In application, the algorithm may be used to simplify retrospective annotation for quality management and, moreover, to support clinicians to assess circulatory state during cardiac arrest treatment.
    Heavy-Tail Phenomenon in Decentralized SGD. (arXiv:2205.06689v1 [stat.ML])
    Recent theoretical studies have shown that heavy-tails can emerge in stochastic optimization due to `multiplicative noise', even under surprisingly simple settings, such as linear regression with Gaussian data. While these studies have uncovered several interesting phenomena, they consider conventional stochastic optimization problems, which exclude decentralized settings that naturally arise in modern machine learning applications. In this paper, we study the emergence of heavy-tails in decentralized stochastic gradient descent (DE-SGD), and investigate the effect of decentralization on the tail behavior. We first show that, when the loss function at each computational node is twice continuously differentiable and strongly convex outside a compact region, the law of the DE-SGD iterates converges to a distribution with polynomially decaying (heavy) tails. To have a more explicit control on the tail exponent, we then consider the case where the loss at each node is a quadratic, and show that the tail-index can be estimated as a function of the step-size, batch-size, and the topological properties of the network of the computational nodes. Then, we provide theoretical and empirical results showing that DE-SGD has heavier tails than centralized SGD. We also compare DE-SGD to disconnected SGD where nodes distribute the data but do not communicate. Our theory uncovers an interesting interplay between the tails and the network structure: we identify two regimes of parameters (stepsize and network size), where DE-SGD %addition of network structure can have lighter or heavier tails than disconnected SGD depending on the regime. Finally, to support our theoretical results, we provide numerical experiments conducted on both synthetic data and neural networks.
    Improving Astronomical Time-series Classification via Data Augmentation with Generative Adversarial Networks. (arXiv:2205.06758v1 [astro-ph.IM])
    Due to the latest advances in technology, telescopes with significant sky coverage will produce millions of astronomical alerts per night that must be classified both rapidly and automatically. Currently, classification consists of supervised machine learning algorithms whose performance is limited by the number of existing annotations of astronomical objects and their highly imbalanced class distributions. In this work, we propose a data augmentation methodology based on Generative Adversarial Networks (GANs) to generate a variety of synthetic light curves from variable stars. Our novel contributions, consisting of a resampling technique and an evaluation metric, can assess the quality of generative models in unbalanced datasets and identify GAN-overfitting cases that the Fr\'echet Inception Distance does not reveal. We applied our proposed model to two datasets taken from the Catalina and Zwicky Transient Facility surveys. The classification accuracy of variable stars is improved significantly when training with synthetic data and testing with real data with respect to the case of using only real data.
    On the Existence of Simpler Machine Learning Models. (arXiv:1908.01755v4 [cs.LG] UPDATED)
    It is almost always easier to find an accurate-but-complex model than an accurate-yet-simple model. Finding optimal, sparse, accurate models of various forms (linear models with integer coefficients, decision sets, rule lists, decision trees) is generally NP-hard. We often do not know whether the search for a simpler model will be worthwhile, and thus we do not go to the trouble of searching for one. In this work, we ask an important practical question: can accurate-yet-simple models be proven to exist, or shown likely to exist, before explicitly searching for them? We hypothesize that there is an important reason that simple-yet-accurate models often do exist. This hypothesis is that the size of the Rashomon set is often large, where the Rashomon set is the set of almost-equally-accurate models from a function class. If the Rashomon set is large, it contains numerous accurate models, and perhaps at least one of them is the simple model we desire. In this work, we formally present the Rashomon ratio as a new gauge of simplicity for a learning problem, depending on a function class and a data set. The Rashomon ratio is the ratio of the volume of the set of accurate models to the volume of the hypothesis space, and it is different from standard complexity measures from statistical learning theory. Insight from studying the Rashomon ratio provides an easy way to check whether a simpler model might exist for a problem before finding it, namely whether several different machine learning methods achieve similar performance on the data. In that sense, the Rashomon ratio is a powerful tool for understanding why and when an accurate-yet-simple model might exist. If, as we hypothesize in this work, many real-world data sets admit large Rashomon sets, the implications are vast: it means that simple or interpretable models may often be used for high-stakes decisions without losing accuracy.
    Tensor Decompositions for Hyperspectral Data Processing in Remote Sensing: A Comprehensive Review. (arXiv:2205.06407v1 [cs.CV])
    Owing to the rapid development of sensor technology, hyperspectral (HS) remote sensing (RS) imaging has provided a significant amount of spatial and spectral information for the observation and analysis of the Earth's surface at a distance of data acquisition devices, such as aircraft, spacecraft, and satellite. The recent advancement and even revolution of the HS RS technique offer opportunities to realize the full potential of various applications, while confronting new challenges for efficiently processing and analyzing the enormous HS acquisition data. Due to the maintenance of the 3-D HS inherent structure, tensor decomposition has aroused widespread concern and research in HS data processing tasks over the past decades. In this article, we aim at presenting a comprehensive overview of tensor decomposition, specifically contextualizing the five broad topics in HS data processing, and they are HS restoration, compressed sensing, anomaly detection, super-resolution, and spectral unmixing. For each topic, we elaborate on the remarkable achievements of tensor decomposition models for HS RS with a pivotal description of the existing methodologies and a representative exhibition on the experimental results. As a result, the remaining challenges of the follow-up research directions are outlined and discussed from the perspective of the real HS RS practices and tensor decomposition merged with advanced priors and even with deep neural networks. This article summarizes different tensor decomposition-based HS data processing methods and categorizes them into different classes from simple adoptions to complex combinations with other priors for the algorithm beginners. We also expect this survey can provide new investigations and development trends for the experienced researchers who understand tensor decomposition and HS RS to some extent.
    FastSTMF: Efficient tropical matrix factorization algorithm for sparse data. (arXiv:2205.06619v1 [cs.LG])
    Matrix factorization, one of the most popular methods in machine learning, has recently benefited from introducing non-linearity in prediction tasks using tropical semiring. The non-linearity enables a better fit to extreme values and distributions, thus discovering high-variance patterns that differ from those found by standard linear algebra. However, the optimization process of various tropical matrix factorization methods is slow. In our work, we propose a new method FastSTMF based on Sparse Tropical Matrix Factorization (STMF), which introduces a novel strategy for updating factor matrices that results in efficient computational performance. We evaluated the efficiency of FastSTMF on synthetic and real gene expression data from the TCGA database, and the results show that FastSTMF outperforms STMF in both accuracy and running time. Compared to NMF, we show that FastSTMF performs better on some datasets and is not prone to overfitting as NMF. This work sets the basis for developing other matrix factorization techniques based on many other semirings using a new proposed optimization process.
    Deep Reinforcement Learning for Computational Fluid Dynamics on HPC Systems. (arXiv:2205.06502v1 [cs.LG])
    Reinforcement learning (RL) is highly suitable for devising control strategies in the context of dynamical systems. A prominent instance of such a dynamical system is the system of equations governing fluid dynamics. Recent research results indicate that RL-augmented computational fluid dynamics (CFD) solvers can exceed the current state of the art, for example in the field of turbulence modeling. However, while in supervised learning, the training data can be generated a priori in an offline manner, RL requires constant run-time interaction and data exchange with the CFD solver during training. In order to leverage the potential of RL-enhanced CFD, the interaction between the CFD solver and the RL algorithm thus have to be implemented efficiently on high-performance computing (HPC) hardware. To this end, we present Relexi as a scalable RL framework that bridges the gap between machine learning workflows and modern CFD solvers on HPC systems providing both components with its specialized hardware. Relexi is built with modularity in mind and allows easy integration of various HPC solvers by means of the in-memory data transfer provided by the SmartSim library. Here, we demonstrate that the Relexi framework can scale up to hundreds of parallel environment on thousands of cores. This allows to leverage modern HPC resources to either enable larger problems or faster turnaround times. Finally, we demonstrate the potential of an RL-augmented CFD solver by finding a control strategy for optimal eddy viscosity selection in large eddy simulations.
    Two-layer neural networks with values in a Banach space. (arXiv:2105.02095v2 [cs.LG] UPDATED)
    We study two-layer neural networks whose domain and range are Banach spaces with separable preduals. In addition, we assume that the image space is equipped with a partial order, i.e. it is a Riesz space. As the nonlinearity we choose the lattice operation of taking the positive part; in case of $\mathbb R^d$-valued neural networks this corresponds to the ReLU activation function. We prove inverse and direct approximation theorems with Monte-Carlo rates for a certain class of functions, extending existing results for the finite-dimensional case. In the second part of the paper, we study, from the regularisation theory viewpoint, the problem of finding optimal representations of such functions via signed measures on a latent space from a finite number of noisy observations. We discuss regularity conditions known as source conditions and obtain convergence rates in a Bregman distance for the representing measure in the regime when both the noise level goes to zero and the number of samples goes to infinity at appropriate rates.
    Discovering the building blocks of dark matter halo density profiles with neural networks. (arXiv:2203.08827v2 [astro-ph.CO] UPDATED)
    The density profiles of dark matter halos are typically modeled using empirical formulae fitted to the density profiles of relaxed halo populations. We present a neural network model that is trained to learn the mapping from the raw density field containing each halo to the dark matter density profile. We show that the model recovers the widely-used Navarro-Frenk-White (NFW) profile out to the virial radius, and can additionally describe the variability in the outer profile of the halos. The neural network architecture consists of a supervised encoder-decoder framework, which first compresses the density inputs into a low-dimensional latent representation, and then outputs $\rho(r)$ for any desired value of radius $r$. The latent representation contains all the information used by the model to predict the density profiles. This allows us to interpret the latent representation by quantifying the mutual information between the representation and the halos' ground-truth density profiles. A two-dimensional representation is sufficient to accurately model the density profiles up to the virial radius; however, a three-dimensional representation is required to describe the outer profiles beyond the virial radius. The additional dimension in the representation contains information about the infalling material in the outer profiles of dark matter halos, thus discovering the splashback boundary of halos without prior knowledge of the halos' dynamical history.
    Space4HGNN: A Novel, Modularized and Reproducible Platform to Evaluate Heterogeneous Graph Neural Network. (arXiv:2202.09177v2 [cs.LG] UPDATED)
    Heterogeneous Graph Neural Network (HGNN) has been successfully employed in various tasks, but we cannot accurately know the importance of different design dimensions of HGNNs due to diverse architectures and applied scenarios. Besides, in the research community of HGNNs, implementing and evaluating various tasks still need much human effort. To mitigate these issues, we first propose a unified framework covering most HGNNs, consisting of three components: heterogeneous linear transformation, heterogeneous graph transformation, and heterogeneous message passing layer. Then we build a platform Space4HGNN by defining a design space for HGNNs based on the unified framework, which offers modularized components, reproducible implementations, and standardized evaluation for HGNNs. Finally, we conduct experiments to analyze the effect of different designs. With the insights found, we distill a condensed design space and verify its effectiveness.
    Stability to Deformations in Manifold Neural Networks. (arXiv:2106.03725v2 [cs.LG] UPDATED)
    Stability is an important property of graph neural networks (GNNs) which explains their success in many problems of practical interest. Existing GNN stability results depend on the size of the graph, restricting applicability to graphs of moderate size. To understand the stability properties of GNNs on large graphs, we define manifold convolutions and consider neural networks supported on manifolds. These are defined in terms of manifold diffusions mediated by the Laplace-Beltrami (LB) operator and are interpreted as limits of GNNs running on graphs of growing size. We define manifold deformations and show that they lead to perturbations of the manifold's LB operator that consist of an absolute and a relative perturbation term. We then define two frequency dependent manifold filters that split the infinite dimensional spectrum of the LB operator in finite partitions, and prove that these filters are stable to absolute and relative perturbations of the LB operator respectively. We also observe a trade-off between the stability and the discriminability from the stability bounds. Moreover, manifold neural networks (MNNs) composed of these filters inherit the stability properties while the nonlinear activation function helps to improve the discriminability. Therefore, the MNNs can be both stable and discriminative. We verify our results numerically in shape classification with point cloud datasets.
    Data-Driven Upper Bounds on Channel Capacity. (arXiv:2205.06471v1 [cs.IT])
    We consider the problem of estimating an upper bound on the capacity of a memoryless channel with unknown channel law and continuous output alphabet. A novel data-driven algorithm is proposed that exploits the dual representation of capacity where the maximization over the input distribution is replaced with a minimization over a reference distribution on the channel output. To efficiently compute the required divergence maximization between the conditional channel and the reference distribution, we use a modified mutual information neural estimator that takes the channel input as an additional parameter. We evaluate our approach on different memoryless channels and show that the estimated upper bounds closely converge either to the channel capacity or to best-known lower bounds.
    Detecting Rumours with Latency Guarantees using Massive Streaming Data. (arXiv:2205.06580v1 [cs.SI])
    Today's social networks continuously generate massive streams of data, which provide a valuable starting point for the detection of rumours as soon as they start to propagate. However, rumour detection faces tight latency bounds, which cannot be met by contemporary algorithms, given the sheer volume of high-velocity streaming data emitted by social networks. Hence, in this paper, we argue for best-effort rumour detection that detects most rumours quickly rather than all rumours with a high delay. To this end, we combine techniques for efficient, graph-based matching of rumour patterns with effective load shedding that discards some of the input data while minimising the loss in accuracy. Experiments with large-scale real-world datasets illustrate the robustness of our approach in terms of runtime performance and detection accuracy under diverse streaming conditions.
    DRBM-ClustNet: A Deep Restricted Boltzmann-Kohonen Architecture for Data Clustering. (arXiv:2205.06697v1 [cs.LG])
    A Bayesian Deep Restricted Boltzmann-Kohonen architecture for data clustering termed as DRBM-ClustNet is proposed. This core-clustering engine consists of a Deep Restricted Boltzmann Machine (DRBM) for processing unlabeled data by creating new features that are uncorrelated and have large variance with each other. Next, the number of clusters are predicted using the Bayesian Information Criterion (BIC), followed by a Kohonen Network-based clustering layer. The processing of unlabeled data is done in three stages for efficient clustering of the non-linearly separable datasets. In the first stage, DRBM performs non-linear feature extraction by capturing the highly complex data representation by projecting the feature vectors of $d$ dimensions into $n$ dimensions. Most clustering algorithms require the number of clusters to be decided a priori, hence here to automate the number of clusters in the second stage we use BIC. In the third stage, the number of clusters derived from BIC forms the input for the Kohonen network, which performs clustering of the feature-extracted data obtained from the DRBM. This method overcomes the general disadvantages of clustering algorithms like the prior specification of the number of clusters, convergence to local optima and poor clustering accuracy on non-linear datasets. In this research we use two synthetic datasets, fifteen benchmark datasets from the UCI Machine Learning repository, and four image datasets to analyze the DRBM-ClustNet. The proposed framework is evaluated based on clustering accuracy and ranked against other state-of-the-art clustering methods. The obtained results demonstrate that the DRBM-ClustNet outperforms state-of-the-art clustering algorithms.
    DualCF: Efficient Model Extraction Attack from Counterfactual Explanations. (arXiv:2205.06504v1 [cs.CR])
    Cloud service providers have launched Machine-Learning-as-a-Service (MLaaS) platforms to allow users to access large-scale cloudbased models via APIs. In addition to prediction outputs, these APIs can also provide other information in a more human-understandable way, such as counterfactual explanations (CF). However, such extra information inevitably causes the cloud models to be more vulnerable to extraction attacks which aim to steal the internal functionality of models in the cloud. Due to the black-box nature of cloud models, however, a vast number of queries are inevitably required by existing attack strategies before the substitute model achieves high fidelity. In this paper, we propose a novel simple yet efficient querying strategy to greatly enhance the querying efficiency to steal a classification model. This is motivated by our observation that current querying strategies suffer from decision boundary shift issue induced by taking far-distant queries and close-to-boundary CFs into substitute model training. We then propose DualCF strategy to circumvent the above issues, which is achieved by taking not only CF but also counterfactual explanation of CF (CCF) as pairs of training samples for the substitute model. Extensive and comprehensive experimental evaluations are conducted on both synthetic and real-world datasets. The experimental results favorably illustrate that DualCF can produce a high-fidelity model with fewer queries efficiently and effectively.
    l-Leaks: Membership Inference Attacks with Logits. (arXiv:2205.06469v1 [cs.LG])
    Machine Learning (ML) has made unprecedented progress in the past several decades. However, due to the memorability of the training data, ML is susceptible to various attacks, especially Membership Inference Attacks (MIAs), the objective of which is to infer the model's training data. So far, most of the membership inference attacks against ML classifiers leverage the shadow model with the same structure as the target model. However, empirical results show that these attacks can be easily mitigated if the shadow model is not clear about the network structure of the target model. In this paper, We present attacks based on black-box access to the target model. We name our attack \textbf{l-Leaks}. The l-Leaks follows the intuition that if an established shadow model is similar enough to the target model, then the adversary can leverage the shadow model's information to predict a target sample's membership.The logits of the trained target model contain valuable sample knowledge. We build the shadow model by learning the logits of the target model and making the shadow model more similar to the target model. Then shadow model will have sufficient confidence in the member samples of the target model. We also discuss the effect of the shadow model's different network structures to attack results. Experiments over different networks and datasets demonstrate that both of our attacks achieve strong performance.
    Uninorm-like parametric activation functions for human-understandable neural models. (arXiv:2205.06547v1 [cs.AI])
    We present a deep learning model for finding human-understandable connections between input features. Our approach uses a parameterized, differentiable activation function, based on the theoretical background of nilpotent fuzzy logic and multi-criteria decision-making (MCDM). The learnable parameter has a semantic meaning indicating the level of compensation between input features. The neural network determines the parameters using gradient descent to find human-understandable relationships between input features. We demonstrate the utility and effectiveness of the model by successfully applying it to classification problems from the UCI Machine Learning Repository.
    FRC-TOuNN: Topology Optimization of Continuous Fiber Reinforced Composites using Neural Network. (arXiv:2205.03737v1 [cs.CE] CROSS LISTED)
    In this paper, we present a topology optimization (TO) framework to simultaneously optimize the matrix topology and fiber distribution of functionally graded continuous fiber-reinforced composites (FRC). Current approaches in density-based TO for FRC use the underlying finite element mesh both for analysis and design representation. This poses several limitations while enforcing sub-element fiber spacing and generating high-resolution continuous fibers. In contrast, we propose a mesh-independent representation based on a neural network (NN) both to capture the matrix topology and fiber distribution. The implicit NN-based representation enables geometric and material queries at a higher resolution than a mesh discretization. This leads to the accurate extraction of functionally-graded continuous fibers. Further, by integrating the finite element simulations into the NN computational framework, we can leverage automatic differentiation for end-to-end automated sensitivity analysis, i.e., we no longer need to manually derive cumbersome sensitivity expressions. We demonstrate the effectiveness and computational efficiency of the proposed method through several numerical examples involving various objective functions. We also show that the optimized continuous fiber reinforced composites can be directly fabricated at high resolution using additive manufacturing.  ( 2 min )
    Modular Adaptive Policy Selection for Multi-Task Imitation Learning through Task Division. (arXiv:2203.14855v2 [cs.LG] UPDATED)
    Deep imitation learning requires many expert demonstrations, which can be hard to obtain, especially when many tasks are involved. However, different tasks often share similarities, so learning them jointly can greatly benefit them and alleviate the need for many demonstrations. But, joint multi-task learning often suffers from negative transfer, sharing information that should be task-specific. In this work, we introduce a method to perform multi-task imitation while allowing for task-specific features. This is done by using proto-policies as modules to divide the tasks into simple sub-behaviours that can be shared. The proto-policies operate in parallel and are adaptively chosen by a selector mechanism that is jointly trained with the modules. Experiments on different sets of tasks show that our method improves upon the accuracy of single agents, task-conditioned and multi-headed multi-task agents, as well as state-of-the-art meta learning agents. We also demonstrate its ability to autonomously divide the tasks into both shared and task-specific sub-behaviours.  ( 2 min )
    Exploiting Expert-guided Symmetry Detection in Offline Reinforcement Learning. (arXiv:2112.09943v2 [cs.LG] UPDATED)
    Offline estimation of the dynamical model of a Markov Decision Process (MDP) is a non-trivial task that greatly depends on the data available to the learning phase. Sometimes the dynamics of the model is invariant with respect to some transformations of the current state and action. Recent works showed that an expert-guided pipeline relying on Density Estimation methods as Deep Neural Network based Normalizing Flows effectively detects this structure in deterministic environments, both categorical and continuous-valued. The acquired knowledge can be exploited to augment the original data set, leading eventually to a reduction in the distributional shift between the true and the learnt model. Such data augmentation technique can be exploited as a preliminary process to be executed before the adoption of an Offline Reinforcement Learning architecture, increasing its performance. In this work we extend the paradigm to also tackle non deterministic MDPs, in particular 1) we propose a detection threshold in categorical environments based on statistical distances, and 2) we show that the former results lead to a performance improvement when solving the learnt MDP and then applying the optimal policy in the real environment.  ( 2 min )
    Rainbow Differential Privacy. (arXiv:2202.03974v2 [cs.CR] UPDATED)
    We extend a previous framework for designing differentially private (DP) mechanisms via randomized graph colorings that was restricted to binary functions, corresponding to colorings in a graph, to multi-valued functions. As before, datasets are nodes in the graph and any two neighboring datasets are connected by an edge. In our setting, we assume that each dataset has a preferential ordering for the possible outputs of the mechanism, each of which we refer to as a rainbow. Different rainbows partition the graph of datasets into different regions. We show that if the DP mechanism is pre-specified at the boundary of such regions and behaves identically for all same-rainbow boundary datasets, at most one optimal such mechanism can exist and the problem can be solved by means of a morphism to a line graph. We then show closed form expressions for the line graph in the case of ternary functions. Treatment of ternary queries in this paper displays enough richness to be extended to higher-dimensional query spaces with preferential query ordering, but the optimality proof does not seem to follow directly from the ternary proof.  ( 2 min )
    Improving Contextual Representation with Gloss Regularized Pre-training. (arXiv:2205.06603v1 [cs.CL])
    Though achieving impressive results on many NLP tasks, the BERT-like masked language models (MLM) encounter the discrepancy between pre-training and inference. In light of this gap, we investigate the contextual representation of pre-training and inference from the perspective of word probability distribution. We discover that BERT risks neglecting the contextual word similarity in pre-training. To tackle this issue, we propose an auxiliary gloss regularizer module to BERT pre-training (GR-BERT), to enhance word semantic similarity. By predicting masked words and aligning contextual embeddings to corresponding glosses simultaneously, the word similarity can be explicitly modeled. We design two architectures for GR-BERT and evaluate our model in downstream tasks. Experimental results show that the gloss regularizer benefits BERT in word-level and sentence-level semantic representation. The GR-BERT achieves new state-of-the-art in lexical substitution task and greatly promotes BERT sentence representation in both unsupervised and supervised STS tasks.  ( 2 min )
    AutoMat: Accelerated Computational Electrochemical systems Discovery. (arXiv:2011.04426v4 [cond-mat.mtrl-sci] UPDATED)
    Large-scale electrification is vital to addressing the climate crisis, but several scientific and technological challenges remain to fully electrify both the chemical industry and transportation. In both of these areas, new electrochemical materials will be critical, but their development currently relies heavily on human-time-intensive experimental trial and error and computationally expensive first-principles, meso-scale and continuum simulations. We present an automated workflow, AutoMat, that accelerates these computational steps by introducing both automated input generation and management of simulations across scales from first principles to continuum device modeling. Furthermore, we show how to seamlessly integrate multi-fidelity predictions such as machine learning surrogates or automated robotic experiments "in-the-loop". The automated framework is implemented with design space search techniques to dramatically accelerate the overall materials discovery pipeline by implicitly learning design features that optimize device performance across several metrics. We discuss the benefits of AutoMat using examples in electrocatalysis and energy storage and highlight lessons learned.  ( 2 min )
    Fast Conditional Network Compression Using Bayesian HyperNetworks. (arXiv:2205.06404v1 [cs.LG])
    We introduce a conditional compression problem and propose a fast framework for tackling it. The problem is how to quickly compress a pretrained large neural network into optimal smaller networks given target contexts, e.g. a context involving only a subset of classes or a context where only limited compute resource is available. To solve this, we propose an efficient Bayesian framework to compress a given large network into much smaller size tailored to meet each contextual requirement. We employ a hypernetwork to parameterize the posterior distribution of weights given conditional inputs and minimize a variational objective of this Bayesian neural network. To further reduce the network sizes, we propose a new input-output group sparsity factorization of weights to encourage more sparseness in the generated weights. Our methods can quickly generate compressed networks with significantly smaller sizes than baseline methods.
    Verifiable and Compositional Reinforcement Learning Systems. (arXiv:2106.05864v3 [cs.LG] UPDATED)
    We propose a framework for verifiable and compositional reinforcement learning (RL) in which a collection of RL subsystems, each of which learns to accomplish a separate subtask, are composed to achieve an overall task. The framework consists of a high-level model, represented as a parametric Markov decision process (pMDP) which is used to plan and to analyze compositions of subsystems, and of the collection of low-level subsystems themselves. By defining interfaces between the subsystems, the framework enables automatic decompositions of task specifications, e.g., reach a target set of states with a probability of at least 0.95, into individual subtask specifications, i.e. achieve the subsystem's exit conditions with at least some minimum probability, given that its entry conditions are met. This in turn allows for the independent training and testing of the subsystems; if they each learn a policy satisfying the appropriate subtask specification, then their composition is guaranteed to satisfy the overall task specification. Conversely, if the subtask specifications cannot all be satisfied by the learned policies, we present a method, formulated as the problem of finding an optimal set of parameters in the pMDP, to automatically update the subtask specifications to account for the observed shortcomings. The result is an iterative procedure for defining subtask specifications, and for training the subsystems to meet them. As an additional benefit, this procedure allows for particularly challenging or important components of an overall task to be determined automatically, and focused on, during training. Experimental results demonstrate the presented framework's novel capabilities.  ( 3 min )
    Collaborative Drug Discovery: Inference-level Data Protection Perspective. (arXiv:2205.06506v1 [cs.CR])
    Pharmaceutical industry can better leverage its data assets to virtualize drug discovery through a collaborative machine learning platform. On the other hand, there are non-negligible risks stemming from the unintended leakage of participants' training data, hence, it is essential for such a platform to be secure and privacy-preserving. This paper describes a privacy risk assessment for collaborative modeling in the preclinical phase of drug discovery to accelerate the selection of promising drug candidates. After a short taxonomy of state-of-the-art inference attacks we adopt and customize several to the underlying scenario. Finally we describe and experiments with a handful of relevant privacy protection techniques to mitigate such attacks.  ( 2 min )
    Computing Multiple Image Reconstructions with a Single Hypernetwork. (arXiv:2202.11009v3 [cs.CV] UPDATED)
    Deep learning based techniques achieve state-of-the-art results in a wide range of image reconstruction tasks like compressed sensing. These methods almost always have hyperparameters, such as the weight coefficients that balance the different terms in the optimized loss function. The typical approach is to train the model for a hyperparameter setting determined with some empirical or theoretical justification. Thus, at inference time, the model can only compute reconstructions corresponding to the pre-determined hyperparameter values. In this work, we present a hypernetwork-based approach, called HyperRecon, to train reconstruction models that are agnostic to hyperparameter settings. At inference time, HyperRecon can efficiently produce diverse reconstructions, which would each correspond to different hyperparameter values. In this framework, the user is empowered to select the most useful output(s) based on their own judgement. We demonstrate our method in compressed sensing, super-resolution and denoising tasks, using two large-scale and publicly-available MRI datasets. Our code is available at https://github.com/alanqrwang/hyperrecon.  ( 2 min )
    Autonomous Navigation and Configuration of Integrated Access Backhauling for UAV Base Station Using Reinforcement Learning. (arXiv:2112.07313v2 [cs.LG] UPDATED)
    Fast and reliable connectivity is essential to enhancing situational awareness and operational efficiency for public safety mission-critical (MC) users. In emergency or disaster circumstances, where existing cellular network coverage and capacity may not be available to meet MC communication demands, deployable-network-based solutions such as cells-on-wheels/wings can be utilized swiftly to ensure reliable connection for MC users. In this paper, we consider a scenario where a macro base station (BS) is destroyed due to a natural disaster and an unmanned aerial vehicle carrying BS (UAV-BS) is set up to provide temporary coverage for users in the disaster area. The UAV-BS is integrated into the mobile network using the 5G integrated access and backhaul (IAB) technology. We propose a framework and signalling procedure for applying machine learning to this use case. A deep reinforcement learning algorithm is designed to jointly optimize the access and backhaul antenna tilt as well as the three-dimensional location of the UAV-BS in order to best serve the on-ground MC users while maintaining a good backhaul connection. Our result shows that the proposed algorithm can autonomously navigate and configure the UAV-BS to improve the throughput and reduce the drop rate of MC users.  ( 2 min )
    CHERRY: a Computational metHod for accuratE pRediction of virus-pRokarYotic interactions using a graph encoder-decoder model. (arXiv:2201.01018v2 [q-bio.GN] UPDATED)
    Prokaryotic viruses, which infect bacteria and archaea, are key players in microbial communities. Predicting the hosts of prokaryotic viruses helps decipher the dynamic relationship between microbes. Experimental methods for host prediction cannot keep pace with the fast accumulation of sequenced phages. Thus, there is a need for computational host prediction. Despite some promising results, computational host prediction remains a challenge because of the limited known interactions and the sheer amount of sequenced phages by high-throughput sequencing technologies. The state-of-the-art methods can only achieve 43\% accuracy at the species level. In this work, we formulate host prediction as link prediction in a knowledge graph that integrates multiple protein and DNA-based sequence features. Our implementation named CHERRY can be applied to predict hosts for newly discovered viruses and to identify viruses infecting targeted bacteria. We demonstrated the utility of CHERRY for both applications and compared its performance with 11 popular host prediction methods. To our best knowledge, CHERRY has the highest accuracy in identifying virus-prokaryote interactions. It outperforms all the existing methods at the species level with an accuracy increase of 37\%. In addition, CHERRY's performance on short contigs is more stable than other tools.  ( 2 min )
    A Machine Learning Analysis of Impact of the Covid-19 Pandemic on Alcohol Consumption Habit Changes Among Healthcare Workers in the U.S. (arXiv:2112.06261v2 [cs.LG] UPDATED)
    In this paper, we discuss the impact of the Covid-19 pandemic on alcohol consumption habit changes among healthcare workers in the United States. We utilize multiple supervised and unsupervised machine learning methods and models such as Decision Trees, Logistic Regression, Naive Bayes classifier, k-Nearest Neighbors, Support Vector Machines, Multilayer perceptron, Random Forests, XGBoost, CatBoost, LightGBM, Synthetic Minority Oversampling, Chi-Squared Test and mutual information method on a mental health survey data obtained from the University of Michigan Inter-University Consortium for Political and Social Research to find out relationships between COVID-19 related negative effects and alcohol consumption habit changes among healthcare workers. Our findings suggest that COVID-19-related school closures, COVID-19-related work schedule changes and COVID-related news exposure may lead to an increase in alcohol use among healthcare workers in the United States.  ( 2 min )
    ARCADE: Adversarially Regularized Convolutional Autoencoder for Network Anomaly Detection. (arXiv:2205.01432v2 [cs.LG] UPDATED)
    As the number of heterogenous IP-connected devices and traffic volume increase, so does the potential for security breaches. The undetected exploitation of these breaches can bring severe cybersecurity and privacy risks. In this paper, we present a practical unsupervised anomaly-based deep learning detection system called ARCADE (Adversarially Regularized Convolutional Autoencoder for unsupervised network anomaly DEtection). ARCADE exploits the property of 1D Convolutional Neural Networks (CNNs) and Generative Adversarial Networks (GAN) to automatically build a profile of the normal traffic based on a subset of raw bytes of a few initial packets of network flows so that potential network anomalies and intrusions can be effectively detected before they could cause any more damage to the network. A convolutional Autoencoder (AE) is proposed that suits online detection in resource-constrained environments, and can be easily improved for environments with higher computational capabilities. An adversarial training strategy is proposed to regularize and decrease the AE's capabilities to reconstruct network flows that are out of the normal distribution, and thereby improve its anomaly detection capabilities. The proposed approach is more effective than existing state-of-the-art deep learning approaches for network anomaly detection and significantly reduces detection time. The evaluation results show that the proposed approach is suitable for anomaly detection on resource-constrained hardware platforms such as Raspberry Pi.  ( 2 min )
    Enhanced Bilevel Optimization via Bregman Distance. (arXiv:2107.12301v2 [math.OC] UPDATED)
    Bilevel optimization has been widely applied to many machine learning problems such as hyperparameter optimization, policy optimization and meta learning. Although many bilevel optimization methods recently have been proposed to solve the bilevel optimization problems, they still suffer from high computational complexities and do not consider the more general bilevel problems with nonsmooth regularization. In the paper, thus, we propose a class of enhanced bilevel optimization methods by using Bregman distance to solve bilevel optimization problems, where the outer subproblem is nonconvex and possibly nonsmooth, and the inner subproblem is strongly convex. Specifically, we propose a bilevel optimization method based on Bregman distance (BiO-BreD) for solving deterministic bilevel problems, which reaches a lower computational complexity than the best known results. Meanwhile, we also propose a stochastic bilevel optimization method (SBiO-BreD) to solve stochastic bilevel problems based on stochastic approximated gradients and Bregman distance. Moreover, we further propose an accelerated version of SBiO-BreD method (ASBiO-BreD) by using the variance-reduced technique, which achieves a lower computational complexity than the best known computational complexity with respect to condition number $\kappa$ and target accuracy $\epsilon$ for finding an $\epsilon$-stationary point. We employ data hyper-cleaning task to demonstrate that our algorithms outperform the existing bilevel algorithms.  ( 2 min )
    Anomaly Detection using Principles of Human Perception. (arXiv:2103.12323v4 [cs.CR] UPDATED)
    In the fields of statistics and unsupervised machine learning a fundamental and well-studied problem is anomaly detection. Anomalies are difficult to define, yet many algorithms have been proposed. Underlying the approaches is the nebulous understanding that anomalies are rare, unusual or inconsistent with the majority of data. The present work provides a philosophical treatise to clearly define anomalies and develops an algorithm for their efficient detection with minimal user intervention. Inspired by the Gestalt School of Psychology and the Helmholtz principle of human perception, anomalies are assumed to be observations that are unexpected to occur with respect to certain groupings made by the majority of the data. Under appropriate random variable modelling anomalies are directly found in a set of data by a uniform and independent random assumption of the distribution of constituent elements of the observations, with anomalies corresponding to those observations where the expectation of the number of occurrences of the elements in a given view is $<1$. Starting from fundamental principles of human perception an unsupervised anomaly detection algorithm is developed that is simple, real-time and parameter-free. Experiments suggest it as a competing choice for univariate data with promising results on the detection of global anomalies in multivariate data.  ( 2 min )
    M\"obius Convolutions for Spherical CNNs. (arXiv:2201.12212v2 [cs.CV] UPDATED)
    M\"obius transformations play an important role in both geometry and spherical image processing - they are the group of conformal automorphisms of 2D surfaces and the spherical equivalent of homographies. Here we present a novel, M\"obius-equivariant spherical convolution operator which we call M\"obius convolution, and with it, develop the foundations for M\"obius-equivariant spherical CNNs. Our approach is based on a simple observation: to achieve equivariance, we only need to consider the lower-dimensional subgroup which transforms the positions of points as seen in the frames of their neighbors. To efficiently compute M\"obius convolutions at scale we derive an approximation of the action of the transformations on spherical filters, allowing us to compute our convolutions in the spectral domain with the fast Spherical Harmonic Transform. The resulting framework is both flexible and descriptive, and we demonstrate its utility by achieving promising results in both shape classification and image segmentation tasks.  ( 2 min )
    EE-Net: Exploitation-Exploration Neural Networks in Contextual Bandits. (arXiv:2110.03177v8 [cs.LG] UPDATED)
    In this paper, we propose a novel neural exploration strategy in contextual bandits, EE-Net, distinct from the standard UCB-based and TS-based approaches. Contextual multi-armed bandits have been studied for decades with various applications. To solve the exploitation-exploration tradeoff in bandits, there are three main techniques: epsilon-greedy, Thompson Sampling (TS), and Upper Confidence Bound (UCB). In recent literature, linear contextual bandits have adopted ridge regression to estimate the reward function and combine it with TS or UCB strategies for exploration. However, this line of works explicitly assumes the reward is based on a linear function of arm vectors, which may not be true in real-world datasets. To overcome this challenge, a series of neural bandit algorithms have been proposed, where a neural network is used to learn the underlying reward function and TS or UCB are adapted for exploration. Instead of calculating a large-deviation based statistical bound for exploration like previous methods, we propose "EE-Net", a novel neural-based exploration strategy. In addition to using a neural network (Exploitation network) to learn the reward function, EE-Net uses another neural network (Exploration network) to adaptively learn potential gains compared to the currently estimated reward for exploration. Then, a decision-maker is constructed to combine the outputs from the Exploitation and Exploration networks. We prove that EE-Net can achieve $\mathcal{O}(\sqrt{T\log T})$ regret and show that EE-Net outperforms existing linear and neural contextual bandit baselines on real-world datasets.
    Self-Supervised Learning for Domain Adaptation on Point-Clouds. (arXiv:2003.12641v5 [cs.CV] UPDATED)
    Self-supervised learning (SSL) is a technique for learning useful representations from unlabeled data. It has been applied effectively to domain adaptation (DA) on images and videos. It is still unknown if and how it can be leveraged for domain adaptation in 3D perception problems. Here we describe the first study of SSL for DA on point clouds. We introduce a new family of pretext tasks, Deformation Reconstruction, inspired by the deformations encountered in sim-to-real transformations. In addition, we propose a novel training procedure for labeled point cloud data motivated by the MixUp method called Point cloud Mixup (PCM). Evaluations on domain adaptations datasets for classification and segmentation, demonstrate a large improvement over existing and baseline methods.
    Nearly Optimal Algorithms for Linear Contextual Bandits with Adversarial Corruptions. (arXiv:2205.06811v1 [cs.LG])
    We study the linear contextual bandit problem in the presence of adversarial corruption, where the reward at each round is corrupted by an adversary, and the corruption level (i.e., the sum of corruption magnitudes over the horizon) is $C\geq 0$. The best-known algorithms in this setting are limited in that they either are computationally inefficient or require a strong assumption on the corruption, or their regret is at least $C$ times worse than the regret without corruption. In this paper, to overcome these limitations, we propose a new algorithm based on the principle of optimism in the face of uncertainty. At the core of our algorithm is a weighted ridge regression where the weight of each chosen action depends on its confidence up to some threshold. We show that for both known $C$ and unknown $C$ cases, our algorithm with proper choice of hyperparameter achieves a regret that nearly matches the lower bounds. Thus, our algorithm is nearly optimal up to logarithmic factors for both cases. Notably, our algorithm achieves the near-optimal regret for both corrupted and uncorrupted cases ($C=0$) simultaneously.
    Sharp Asymptotics of Kernel Ridge Regression Beyond the Linear Regime. (arXiv:2205.06798v1 [cs.LG])
    The generalization performance of kernel ridge regression (KRR) exhibits a multi-phased pattern that crucially depends on the scaling relationship between the sample size $n$ and the underlying dimension $d$. This phenomenon is due to the fact that KRR sequentially learns functions of increasing complexity as the sample size increases; when $d^{k-1}\ll n\ll d^{k}$, only polynomials with degree less than $k$ are learned. In this paper, we present sharp asymptotic characterization of the performance of KRR at the critical transition regions with $n \asymp d^k$, for $k\in\mathbb{Z}^{+}$. Our asymptotic characterization provides a precise picture of the whole learning process and clarifies the impact of various parameters (including the choice of the kernel function) on the generalization performance. In particular, we show that the learning curves of KRR can have a delicate "double descent" behavior due to specific bias-variance trade-offs at different polynomial scaling regimes.
    A Comprehensive Survey of Few-shot Learning: Evolution, Applications, Challenges, and Opportunities. (arXiv:2205.06743v1 [cs.LG])
    Few-shot learning (FSL) has emerged as an effective learning method and shows great potential. Despite the recent creative works in tackling FSL tasks, learning valid information rapidly from just a few or even zero samples still remains a serious challenge. In this context, we extensively investigated 200+ latest papers on FSL published in the past three years, aiming to present a timely and comprehensive overview of the most recent advances in FSL along with impartial comparisons of the strengths and weaknesses of the existing works. For the sake of avoiding conceptual confusion, we first elaborate and compare a set of similar concepts including few-shot learning, transfer learning, and meta-learning. Furthermore, we propose a novel taxonomy to classify the existing work according to the level of abstraction of knowledge in accordance with the challenges of FSL. To enrich this survey, in each subsection we provide in-depth analysis and insightful discussion about recent advances on these topics. Moreover, taking computer vision as an example, we highlight the important application of FSL, covering various research hotspots. Finally, we conclude the survey with unique insights into the technology evolution trends together with potential future research opportunities in the hope of providing guidance to follow-up research.
    Differentiable Graph Module (DGM) for Graph Convolutional Networks. (arXiv:2002.04999v4 [cs.LG] UPDATED)
    Graph deep learning has recently emerged as a powerful ML concept allowing to generalize successful deep neural architectures to non-Euclidean structured data. Such methods have shown promising results on a broad spectrum of applications ranging from social science, biomedicine, and particle physics to computer vision, graphics, and chemistry. One of the limitations of the majority of current graph neural network architectures is that they are often restricted to the transductive setting and rely on the assumption that the underlying graph is {\em known} and {\em fixed}. Often, this assumption is not true since the graph may be noisy, or partially and even completely unknown. In such cases, it would be helpful to infer the graph directly from the data, especially in inductive settings where some nodes were not present in the graph at training time. Furthermore, learning a graph may become an end in itself, as the inferred structure may provide complementary insights next to the downstream task. In this paper, we introduce Differentiable Graph Module (DGM), a learnable function that predicts edge probabilities in the graph which are optimal for the downstream task. DGM can be combined with convolutional graph neural network layers and trained in an end-to-end fashion. We provide an extensive evaluation of applications from the domains of healthcare (disease prediction), brain imaging (age prediction), computer graphics (3D point cloud segmentation), and computer vision (zero-shot learning). We show that our model provides a significant improvement over baselines both in transductive and inductive settings and achieves state-of-the-art results.
    Productivity Assessment of Neural Code Completion. (arXiv:2205.06537v1 [cs.SE])
    Neural code synthesis has reached a point where snippet generation is accurate enough to be considered for integration into human software development workflows. Commercial products aim to increase programmers' productivity, without being able to measure it directly. In this case study, we asked users of GitHub Copilot about its impact on their productivity, and sought to find a reflection of their perception in directly measurable user data. We find that the rate with which shown suggestions are accepted, rather than more specific metrics regarding the persistence of completions in the code over time, drives developers' perception of productivity.
    Latent-Graph Learning for Disease Prediction. (arXiv:2003.13620v2 [cs.LG] UPDATED)
    Recently, Graph Convolutional Networks (GCNs) have proven to be a powerful machine learning tool for Computer-Aided Diagnosis (CADx) and disease prediction. A key component in these models is to build a population graph, where the graph adjacency matrix represents pair-wise patient similarities. Until now, the similarity metrics have been defined manually, usually based on meta-features like demographics or clinical scores. The definition of the metric, however, needs careful tuning, as GCNs are very sensitive to the graph structure. In this paper, we demonstrate for the first time in the CADx domain that it is possible to learn a single, optimal graph towards the GCN's downstream task of disease classification. To this end, we propose a novel, end-to-end trainable graph learning architecture for dynamic and localized graph pruning. Unlike commonly employed spectral GCN approaches, our GCN is spatial and inductive, and can thus infer previously unseen patients as well. We demonstrate significant classification improvements with our learned graph on two CADx problems in medicine. We further explain and visualize this result using an artificial dataset, underlining the importance of graph learning for more accurate and robust inference with GCNs in medical applications.
    The Design Space of E(3)-Equivariant Atom-Centered Interatomic Potentials. (arXiv:2205.06643v1 [stat.ML])
    The rapid progress of machine learning interatomic potentials over the past couple of years produced a number of new architectures. Particularly notable among these are the Atomic Cluster Expansion (ACE), which unified many of the earlier ideas around atom density-based descriptors, and Neural Equivariant Interatomic Potentials (NequIP), a message passing neural network with equivariant features that showed state of the art accuracy. In this work, we construct a mathematical framework that unifies these models: ACE is generalised so that it can be recast as one layer of a multi-layer architecture. From another point of view, the linearised version of NequIP is understood as a particular sparsification of a much larger polynomial model. Our framework also provides a practical tool for systematically probing different choices in the unified design space. We demonstrate this by an ablation study of NequIP via a set of experiments looking at in- and out-of-domain accuracy and smooth extrapolation very far from the training data, and shed some light on which design choices are critical for achieving high accuracy. Finally, we present BOTNet (Body-Ordered-Tensor-Network), a much-simplified version of NequIP, which has an interpretable architecture and maintains accuracy on benchmark datasets.
    Generalized Variational Inference in Function Spaces: Gaussian Measures meet Bayesian Deep Learning. (arXiv:2205.06342v1 [stat.ML])
    We develop a framework for generalized variational inference in infinite-dimensional function spaces and use it to construct a method termed Gaussian Wasserstein inference (GWI). GWI leverages the Wasserstein distance between Gaussian measures on the Hilbert space of square-integrable functions in order to determine a variational posterior using a tractable optimisation criterion and avoids pathologies arising in standard variational function space inference. An exciting application of GWI is the ability to use deep neural networks in the variational parametrisation of GWI, combining their superior predictive performance with the principled uncertainty quantification analogous to that of Gaussian processes. The proposed method obtains state-of-the-art performance on several benchmark datasets.
    The ACM Multimedia 2022 Computational Paralinguistics Challenge: Vocalisations, Stuttering, Activity, & Mosquitoes. (arXiv:2205.06799v1 [cs.SD])
    The ACM Multimedia 2022 Computational Paralinguistics Challenge addresses four different problems for the first time in a research competition under well-defined conditions: In the Vocalisations and Stuttering Sub-Challenges, a classification on human non-verbal vocalisations and speech has to be made; the Activity Sub-Challenge aims at beyond-audio human activity recognition from smartwatch sensor data; and in the Mosquitoes Sub-Challenge, mosquitoes need to be detected. We describe the Sub-Challenges, baseline feature extraction, and classifiers based on the usual ComPaRE and BoAW features, the auDeep toolkit, and deep feature extraction from pre-trained CNNs using the DeepSpectRum toolkit; in addition, we add end-to-end sequential modelling, and a log-mel-128-BNN.
    Toward A Formalized Approach for Spike Sorting Algorithms and Hardware Evaluation. (arXiv:2205.06514v1 [cs.LG])
    Spike sorting algorithms are used to separate extracellular recordings of neuronal populations into single-unit spike activities. The development of customized hardware implementing spike sorting algorithms is burgeoning. However, there is a lack of a systematic approach and a set of standardized evaluation criteria to facilitate direct comparison of both software and hardware implementations. In this paper, we formalize a set of standardized criteria and a publicly available synthetic dataset entitled Synthetic Simulations Of Extracellular Recordings (SSOER), which was constructed by aggregating existing synthetic datasets with varying Signal-To-Noise Ratios (SNRs). Furthermore, we present a benchmark for future comparison, and use our criteria to evaluate a simulated Resistive Random-Access Memory (RRAM) In-Memory Computing (IMC) system using the Discrete Wavelet Transform (DWT) for feature extraction. Our system consumes approximately (per channel) 10.72mW and occupies an area of 0.66mm$^2$ in a 22nm FDSOI Complementary Metal-Oxide-Semiconductor (CMOS) process.
    NN-EUCLID: deep-learning hyperelasticity without stress data. (arXiv:2205.06664v1 [cs.LG])
    We propose a new approach for unsupervised learning of hyperelastic constitutive laws with physics-consistent deep neural networks. In contrast to supervised learning, which assumes the availability of stress-strain pairs, the approach only uses realistically measurable full-field displacement and global reaction force data, thus it lies within the scope of our recent framework for Efficient Unsupervised Constitutive Law Identification and Discovery (EUCLID) and we denote it as NN-EUCLID. The absence of stress labels is compensated for by leveraging a physics-motivated loss function based on the conservation of linear momentum to guide the learning process. The constitutive model is based on input-convex neural networks, which are capable of learning a function that is convex with respect to its inputs. By employing a specially designed neural network architecture, multiple physical and thermodynamic constraints for hyperelastic constitutive laws, such as material frame indifference, (poly-)convexity, and stress-free reference configuration are automatically satisfied. We demonstrate the ability of the approach to accurately learn several hidden isotropic and anisotropic hyperelastic constitutive laws - including e.g., Mooney-Rivlin, Arruda-Boyce, Ogden, and Holzapfel models - without using stress data. For anisotropic hyperelasticity, the unknown anisotropic fiber directions are automatically discovered jointly with the constitutive model. The neural network-based constitutive models show good generalization capability beyond the strain states observed during training and are readily deployable in a general finite element framework for simulating complex mechanical boundary value problems with good accuracy.  ( 2 min )
    PoisonedEncoder: Poisoning the Unlabeled Pre-training Data in Contrastive Learning. (arXiv:2205.06401v1 [cs.CR])
    Contrastive learning pre-trains an image encoder using a large amount of unlabeled data such that the image encoder can be used as a general-purpose feature extractor for various downstream tasks. In this work, we propose PoisonedEncoder, a data poisoning attack to contrastive learning. In particular, an attacker injects carefully crafted poisoning inputs into the unlabeled pre-training data, such that the downstream classifiers built based on the poisoned encoder for multiple target downstream tasks simultaneously classify attacker-chosen, arbitrary clean inputs as attacker-chosen, arbitrary classes. We formulate our data poisoning attack as a bilevel optimization problem, whose solution is the set of poisoning inputs; and we propose a contrastive-learning-tailored method to approximately solve it. Our evaluation on multiple datasets shows that PoisonedEncoder achieves high attack success rates while maintaining the testing accuracy of the downstream classifiers built upon the poisoned encoder for non-attacker-chosen inputs. We also evaluate five defenses against PoisonedEncoder, including one pre-processing, three in-processing, and one post-processing defenses. Our results show that these defenses can decrease the attack success rate of PoisonedEncoder, but they also sacrifice the utility of the encoder or require a large clean pre-training dataset.  ( 2 min )
    OFedQIT: Communication-Efficient Online Federated Learning via Quantization and Intermittent Transmission. (arXiv:2205.06491v1 [cs.LG])
    Online federated learning (OFL) is a promising framework to collaboratively learn a sequence of non-linear functions (or models) from distributed streaming data incoming to multiple clients while keeping the privacy of their local data. In this framework, we first construct a vanilla method (named OFedAvg) by incorporating online gradient descent (OGD) into the de facto aggregation method (named FedAvg). Despite its optimal asymptotic performance, OFedAvg suffers from heavy communication overhead and long learning delay. To tackle these shortcomings, we propose a communication-efficient OFL algorithm (named OFedQIT) by means of a stochastic quantization and an intermittent transmission. Our major contribution is to theoretically prove that OFedQIT over $T$ time slots can achieve an optimal sublinear regret bound $\mathcal{O}(\sqrt{T})$ for any real data (including non-IID data) while significantly reducing the communication overhead. Furthermore, this optimality is still guaranteed even when a small fraction of clients (having faster processing time and high-quality communication channel) in a network are participated at once. Our analysis reveals that OFedQIT successfully addresses the drawbacks of OFedAvg while maintaining superior learning accuracy. Experiments with real datasets demonstrate the effectiveness of our algorithm on various online classification and regression tasks.  ( 2 min )
    Convergence Analysis of Deep Residual Networks. (arXiv:2205.06571v1 [cs.LG])
    Various powerful deep neural network architectures have made great contribution to the exciting successes of deep learning in the past two decades. Among them, deep Residual Networks (ResNets) are of particular importance because they demonstrated great usefulness in computer vision by winning the first place in many deep learning competitions. Also, ResNets were the first class of neural networks in the development history of deep learning that are really deep. It is of mathematical interest and practical meaning to understand the convergence of deep ResNets. We aim at characterizing the convergence of deep ResNets as the depth tends to infinity in terms of the parameters of the networks. Toward this purpose, we first give a matrix-vector description of general deep neural networks with shortcut connections and formulate an explicit expression for the networks by using the notions of activation domains and activation matrices. The convergence is then reduced to the convergence of two series involving infinite products of non-square matrices. By studying the two series, we establish a sufficient condition for pointwise convergence of ResNets. Our result is able to give justification for the design of ResNets. We also conduct experiments on benchmark machine learning data to verify our results.  ( 2 min )
    Convergence of Deep Neural Networks with General Activation Functions and Pooling. (arXiv:2205.06570v1 [cs.LG])
    Deep neural networks, as a powerful system to represent high dimensional complex functions, play a key role in deep learning. Convergence of deep neural networks is a fundamental issue in building the mathematical foundation for deep learning. We investigated the convergence of deep ReLU networks and deep convolutional neural networks in two recent researches (arXiv:2107.12530, 2109.13542). Only the Rectified Linear Unit (ReLU) activation was studied therein, and the important pooling strategy was not considered. In this current work, we study the convergence of deep neural networks as the depth tends to infinity for two other important activation functions: the leaky ReLU and the sigmoid function. Pooling will also be studied. As a result, we prove that the sufficient condition established in arXiv:2107.12530, 2109.13542 is still sufficient for the leaky ReLU networks. For contractive activation functions such as the sigmoid function, we establish a weaker sufficient condition for uniform convergence of deep neural networks.  ( 2 min )
    Test-time Fourier Style Calibration for Domain Generalization. (arXiv:2205.06427v1 [cs.CV])
    The topic of generalizing machine learning models learned on a collection of source domains to unknown target domains is challenging. While many domain generalization (DG) methods have achieved promising results, they primarily rely on the source domains at train-time without manipulating the target domains at test-time. Thus, it is still possible that those methods can overfit to source domains and perform poorly on target domains. Driven by the observation that domains are strongly related to styles, we argue that reducing the gap between source and target styles can boost models' generalizability. To solve the dilemma of having no access to the target domain during training, we introduce Test-time Fourier Style Calibration (TF-Cal) for calibrating the target domain style on the fly during testing. To access styles, we utilize Fourier transformation to decompose features into amplitude (style) features and phase (semantic) features. Furthermore, we present an effective technique to Augment Amplitude Features (AAF) to complement TF-Cal. Extensive experiments on several popular DG benchmarks and a segmentation dataset for medical images demonstrate that our method outperforms state-of-the-art methods.  ( 2 min )
    Deep Learning for Prawn Farming: Forecasting and Anomaly Detection. (arXiv:2205.06359v1 [cs.LG])
    We present a decision support system for managing water quality in prawn ponds. The system uses various sources of data and deep learning models in a novel way to provide 24-hour forecasting and anomaly detection of water quality parameters. It provides prawn farmers with tools to proactively avoid a poor growing environment, thereby optimising growth and reducing the risk of losing stock. This is a major shift for farmers who are forced to manage ponds by reactively correcting poor water quality conditions. To our knowledge, we are the first to apply Transformer as an anomaly detection model, and the first to apply anomaly detection in general to this aquaculture problem. Our technical contributions include adapting ForecastNet for multivariate data and adapting Transformer and the Attention model to incorporate weather forecast data into their decoders. We attain an average mean absolute percentage error of 12% for dissolved oxygen forecasts and we demonstrate two anomaly detection case studies. The system is successfully running in its second year of deployment on a commercial prawn farm.  ( 2 min )
    StyLandGAN: A StyleGAN based Landscape Image Synthesis using Depth-map. (arXiv:2205.06611v1 [cs.CV])
    Despite recent success in conditional image synthesis, prevalent input conditions such as semantics and edges are not clear enough to express `Linear (Ridges)' and `Planar (Scale)' representations. To address this problem, we propose a novel framework StyLandGAN, which synthesizes desired landscape images using a depth map which has higher expressive power. Our StyleLandGAN is extended from the unconditional generation model to accept input conditions. We also propose a '2-phase inference' pipeline which generates diverse depth maps and shifts local parts so that it can easily reflect user's intend. As a comparison, we modified the existing semantic image synthesis models to accept a depth map as well. Experimental results show that our method is superior to existing methods in quality, diversity, and depth-accuracy.  ( 2 min )
    KASAM: Spline Additive Models for Function Approximation. (arXiv:2205.06376v1 [cs.LG])
    Neural networks have been criticised for their inability to perform continual learning due to catastrophic forgetting and rapid unlearning of a past concept when a new concept is introduced. Catastrophic forgetting can be alleviated by specifically designed models and training techniques. This paper outlines a novel Spline Additive Model (SAM). SAM exhibits intrinsic memory retention with sufficient expressive power for many practical tasks, but is not a universal function approximator. SAM is extended with the Kolmogorov-Arnold representation theorem to a novel universal function approximator, called the Kolmogorov-Arnold Spline Additive Model - KASAM. The memory retention, expressive power and limitations of SAM and KASAM are illustrated analytically and empirically. SAM exhibited robust but imperfect memory retention, with small regions of overlapping interference in sequential learning tasks. KASAM exhibited greater susceptibility to catastrophic forgetting. KASAM in combination with pseudo-rehearsal training techniques exhibited superior performance in regression tasks and memory retention.  ( 2 min )
    Interpretable Climate Change Modeling With Progressive Cascade Networks. (arXiv:2205.06351v1 [cs.LG])
    Typical deep learning approaches to modeling high-dimensional data often result in complex models that do not easily reveal a new understanding of the data. Research in the deep learning field is very actively pursuing new methods to interpret deep neural networks and to reduce their complexity. An approach is described here that starts with linear models and incrementally adds complexity only as supported by the data. An application is shown in which models that map global temperature and precipitation to years are trained to investigate patterns associated with changes in climate.  ( 2 min )
    A hybrid data driven-physics constrained Gaussian process regression framework with deep kernel for uncertainty quantification. (arXiv:2205.06494v1 [cs.LG])
    Gaussian process regression (GPR) has been a well-known machine learning method for various applications such as uncertainty quantifications (UQ). However, GPR is inherently a data-driven method, which requires sufficiently large dataset. If appropriate physics constraints (e.g. expressed in partial differential equations) can be incorporated, the amount of data can be greatly reduced and the accuracy further improved. In this work, we propose a hybrid data driven-physics constrained Gaussian process regression framework. We encode the physics knowledge with Boltzmann-Gibbs distribution and derive our model through maximum likelihood (ML) approach. We apply deep kernel learning method. The proposed model learns from both data and physics constraints through the training of a deep neural network, which serves as part of the covariance function in GPR. The proposed model achieves good results in high-dimensional problem, and correctly propagate the uncertainty, with very limited labelled data provided.  ( 2 min )
    Warm-starting DARTS using meta-learning. (arXiv:2205.06355v1 [cs.LG])
    Neural architecture search (NAS) has shown great promise in the field of automated machine learning (AutoML). NAS has outperformed hand-designed networks and made a significant step forward in the field of automating the design of deep neural networks, thus further reducing the need for human expertise. However, most research is done targeting a single specific task, leaving research of NAS methods over multiple tasks mostly overlooked. Generally, there exist two popular ways to find an architecture for some novel task. Either searching from scratch, which is ineffective by design, or transferring discovered architectures from other tasks, which provides no performance guarantees and is probably not optimal. In this work, we present a meta-learning framework to warm-start Differentiable architecture search (DARTS). DARTS is a NAS method that can be initialized with a transferred architecture and is able to quickly adapt to new tasks. A task similarity measure is used to determine which transfer architecture is selected, as transfer architectures found on similar tasks will likely perform better. Additionally, we employ a simple meta-transfer architecture that was learned over multiple tasks. Experiments show that warm-started DARTS is able to find competitive performing architectures while reducing searching costs on average by 60%.  ( 2 min )
    How to Combine Membership-Inference Attacks on Multiple Updated Models. (arXiv:2205.06369v1 [cs.LG])
    A large body of research has shown that machine learning models are vulnerable to membership inference (MI) attacks that violate the privacy of the participants in the training data. Most MI research focuses on the case of a single standalone model, while production machine-learning platforms often update models over time, on data that often shifts in distribution, giving the attacker more information. This paper proposes new attacks that take advantage of one or more model updates to improve MI. A key part of our approach is to leverage rich information from standalone MI attacks mounted separately against the original and updated models, and to combine this information in specific ways to improve attack effectiveness. We propose a set of combination functions and tuning methods for each, and present both analytical and quantitative justification for various options. Our results on four public datasets show that our attacks are effective at using update information to give the adversary a significant advantage over attacks on standalone models, but also compared to a prior MI attack that takes advantage of model updates in a related machine-unlearning setting. We perform the first measurements of the impact of distribution shift on MI attacks with model updates, and show that a more drastic distribution shift results in significantly higher MI risk than a gradual shift. Our code is available at https://www.github.com/stanleykywu/model-updates.  ( 2 min )
    Using Natural Sentences for Understanding Biases in Language Models. (arXiv:2205.06303v1 [cs.CL])
    Evaluation of biases in language models is often limited to synthetically generated datasets. This dependence traces back to the need for a prompt-style dataset to trigger specific behaviors of language models. In this paper, we address this gap by creating a prompt dataset with respect to occupations collected from real-world natural sentences present in Wikipedia. We aim to understand the differences between using template-based prompts and natural sentence prompts when studying gender-occupation biases in language models. We find bias evaluations are very sensitive to the design choices of template prompts, and we propose using natural sentence prompts for systematic evaluations to step away from design choices that could introduce bias in the observations.  ( 2 min )
    Adaptive Block Floating-Point for Analog Deep Learning Hardware. (arXiv:2205.06287v1 [cs.LG])
    Analog mixed-signal (AMS) devices promise faster, more energy-efficient deep neural network (DNN) inference than their digital counterparts. However, recent studies show that DNNs on AMS devices with fixed-point numbers can incur an accuracy penalty because of precision loss. To mitigate this penalty, we present a novel AMS-compatible adaptive block floating-point (ABFP) number representation. We also introduce amplification (or gain) as a method for increasing the accuracy of the number representation without increasing the bit precision of the output. We evaluate the effectiveness of ABFP on the DNNs in the MLPerf datacenter inference benchmark -- realizing less than $1\%$ loss in accuracy compared to FLOAT32. We also propose a novel method of finetuning for AMS devices, Differential Noise Finetuning (DNF), which samples device noise to speed up finetuning compared to conventional Quantization-Aware Training.  ( 2 min )
    $\alpha$-GAN: Convergence and Estimation Guarantees. (arXiv:2205.06393v1 [cs.LG])
    We prove a two-way correspondence between the min-max optimization of general CPE loss function GANs and the minimization of associated $f$-divergences. We then focus on $\alpha$-GAN, defined via the $\alpha$-loss, which interpolates several GANs (Hellinger, vanilla, Total Variation) and corresponds to the minimization of the Arimoto divergence. We show that the Arimoto divergences induced by $\alpha$-GAN equivalently converge, for all $\alpha\in \mathbb{R}_{>0}\cup\{\infty\}$. However, under restricted learning models and finite samples, we provide estimation bounds which indicate diverse GAN behavior as a function of $\alpha$. Finally, we present empirical results on a toy dataset that highlight the practical utility of tuning the $\alpha$ hyperparameter.  ( 2 min )
    Integrating User and Item Reviews in Deep Cooperative Neural Networks for Movie Recommendation. (arXiv:2205.06296v1 [cs.IR])
    User evaluations include a significant quantity of information across online platforms. This information source has been neglected by the majority of existing recommendation systems, despite its potential to ease the sparsity issue and enhance the quality of suggestions. This work presents a deep model for concurrently learning item attributes and user behaviour from review text. Deep Cooperative Neural Networks (DeepCoNN) is the suggested model consisting of two parallel neural networks connected in their final layers. One of the networks focuses on learning user behaviour from reviews submitted by the user, while the other network learns item attributes from user reviews. On top, a shared layer is added to connect these two networks. Similar to factorization machine approaches, the shared layer allows latent factors acquired for people and things to interact with each other. On a number of datasets, DeepCoNN surpasses all baseline recommendation systems, according to experimental findings.  ( 2 min )
    Collaborative Multi-agent Stochastic Linear Bandits. (arXiv:2205.06331v1 [cs.LG])
    We study a collaborative multi-agent stochastic linear bandit setting, where $N$ agents that form a network communicate locally to minimize their overall regret. In this setting, each agent has its own linear bandit problem (its own reward parameter) and the goal is to select the best global action w.r.t. the average of their reward parameters. At each round, each agent proposes an action, and one action is randomly selected and played as the network action. All the agents observe the corresponding rewards of the played actions and use an accelerated consensus procedure to compute an estimate of the average of the rewards obtained by all the agents. We propose a distributed upper confidence bound (UCB) algorithm and prove a high probability bound on its $T$-round regret in which we include a linear growth of regret associated with each communication round. Our regret bound is of order $\mathcal{O}\Big(\sqrt{\frac{T}{N \log(1/|\lambda_2|)}}\cdot (\log T)^2\Big)$, where $\lambda_2$ is the second largest (in absolute value) eigenvalue of the communication matrix.  ( 2 min )
    Visuomotor Control in Multi-Object Scenes Using Object-Aware Representations. (arXiv:2205.06333v1 [cs.RO])
    Perceptual understanding of the scene and the relationship between its different components is important for successful completion of robotic tasks. Representation learning has been shown to be a powerful technique for this, but most of the current methodologies learn task specific representations that do not necessarily transfer well to other tasks. Furthermore, representations learned by supervised methods require large labeled datasets for each task that are expensive to collect in the real world. Using self-supervised learning to obtain representations from unlabeled data can mitigate this problem. However, current self-supervised representation learning methods are mostly object agnostic, and we demonstrate that the resulting representations are insufficient for general purpose robotics tasks as they fail to capture the complexity of scenes with many components. In this paper, we explore the effectiveness of using object-aware representation learning techniques for robotic tasks. Our self-supervised representations are learned by observing the agent freely interacting with different parts of the environment and is queried in two different settings: (i) policy learning and (ii) object location prediction. We show that our model learns control policies in a sample-efficient manner and outperforms state-of-the-art object agnostic techniques as well as methods trained on raw RGB images. Our results show a 20 percent increase in performance in low data regimes (1000 trajectories) in policy training using implicit behavioral cloning (IBC). Furthermore, our method outperforms the baselines for the task of object localization in multi-object scenes.  ( 2 min )
    Multi-Environment Meta-Learning in Stochastic Linear Bandits. (arXiv:2205.06326v1 [cs.LG])
    In this work we investigate meta-learning (or learning-to-learn) approaches in multi-task linear stochastic bandit problems that can originate from multiple environments. Inspired by the work of [1] on meta-learning in a sequence of linear bandit problems whose parameters are sampled from a single distribution (i.e., a single environment), here we consider the feasibility of meta-learning when task parameters are drawn from a mixture distribution instead. For this problem, we propose a regularized version of the OFUL algorithm that, when trained on tasks with labeled environments, achieves low regret on a new task without requiring knowledge of the environment from which the new task originates. Specifically, our regret bound for the new algorithm captures the effect of environment misclassification and highlights the benefits over learning each task separately or meta-learning without recognition of the distinct mixture components.  ( 2 min )
    Detailed Balanced Chemical Reaction Networks as Generalized Boltzmann Machines. (arXiv:2205.06313v1 [q-bio.MN])
    Can a micron sized sack of interacting molecules understand, and adapt to a constantly-fluctuating environment? Cellular life provides an existence proof in the affirmative, but the principles that allow for life's existence are far from being proven. One challenge in engineering and understanding biochemical computation is the intrinsic noise due to chemical fluctuations. In this paper, we draw insights from machine learning theory, chemical reaction network theory, and statistical physics to show that the broad and biologically relevant class of detailed balanced chemical reaction networks is capable of representing and conditioning complex distributions. These results illustrate how a biochemical computer can use intrinsic chemical noise to perform complex computations. Furthermore, we use our explicit physical model to derive thermodynamic costs of inference.  ( 2 min )
    Design and Implementation of a Quantum Kernel for Natural Language Processing. (arXiv:2205.06409v1 [cs.CL])
    Natural language processing (NLP) is the field that attempts to make human language accessible to computers, and it relies on applying a mathematical model to express the meaning of symbolic language. One such model, DisCoCat, defines how to express both the meaning of individual words as well as their compositional nature. This model can be naturally implemented on quantum computers, leading to the field quantum NLP (QNLP). Recent experimental work used quantum machine learning techniques to map from text to class label using the expectation value of the quantum encoded sentence. Theoretical work has been done on computing the similarity of sentences but relies on an unrealized quantum memory store. The main goal of this thesis is to leverage the DisCoCat model to design a quantum-based kernel function that can be used by a support vector machine (SVM) for NLP tasks. Two similarity measures were studied: (i) the transition amplitude approach and (ii) the SWAP test. A simple NLP meaning classification task from previous work was used to train the word embeddings and evaluate the performance of both models. The Python module lambeq and its related software stack was used for implementation. The explicit model from previous work was used to train word embeddings and achieved a testing accuracy of $93.09 \pm 0.01$%. It was shown that both the SVM variants achieved a higher testing accuracy of $95.72 \pm 0.01$% for approach (i) and $97.14 \pm 0.01$% for (ii). The SWAP test was then simulated under a noise model defined by the real quantum device, ibmq_guadalupe. The explicit model achieved an accuracy of $91.94 \pm 0.01$% while the SWAP test SVM achieved 96.7% on the testing dataset, suggesting that the kernelized classifiers are resilient to noise. These are encouraging results and motivate further investigations of our proposed kernelized QNLP paradigm.  ( 2 min )
    Improving Sequential Query Recommendation with Immediate User Feedback. (arXiv:2205.06297v1 [cs.IR])
    We propose an algorithm for next query recommendation in interactive data exploration settings, like knowledge discovery for information gathering. The state-of-the-art query recommendation algorithms are based on sequence-to-sequence learning approaches that exploit historical interaction data. We propose to augment the transformer-based causal language models for query recommendations to adapt to the immediate user feedback using multi-armed bandit (MAB) framework. We conduct a large-scale experimental study using log files from a popular online literature discovery service and demonstrate that our algorithm improves the cumulative regret substantially, with respect to the state-of-the-art transformer-based query recommendation models, which do not make use of the immediate user feedback. Our data model and source code are available at ~\url{https://anonymous.4open.science/r/exp3_ss-9985/}.  ( 2 min )
    Modularity in NEAT Reinforcement Learning Networks. (arXiv:2205.06451v1 [cs.NE])
    Modularity is essential to many well-performing structured systems, as it is a useful means of managing complexity [8]. An analysis of modularity in neural networks produced by machine learning algorithms can offer valuable insight into the workings of such algorithms and how modularity can be leveraged to improve performance. However, this property is often overlooked in the neuroevolutionary literature, so the modular nature of many learning algorithms is unknown. This property was assessed on the popular algorithm "NeuroEvolution of Augmenting Topologies" (NEAT) for standard simulation benchmark control problems due to NEAT's ability to optimise network topology. This paper shows that NEAT networks seem to rapidly increase in modularity over time with the rate and convergence dependent on the problem. Interestingly, NEAT tends towards increasingly modular networks even when network fitness converges. It was shown that the ideal level of network modularity in the explored parameter space is highly dependent on other network variables, dispelling theories that modularity has a straightforward relationship to network performance. This is further proven in this paper by demonstrating that rewarding modularity directly did not improve fitness.  ( 2 min )
  • Open

    On the Existence of Simpler Machine Learning Models. (arXiv:1908.01755v4 [cs.LG] UPDATED)
    It is almost always easier to find an accurate-but-complex model than an accurate-yet-simple model. Finding optimal, sparse, accurate models of various forms (linear models with integer coefficients, decision sets, rule lists, decision trees) is generally NP-hard. We often do not know whether the search for a simpler model will be worthwhile, and thus we do not go to the trouble of searching for one. In this work, we ask an important practical question: can accurate-yet-simple models be proven to exist, or shown likely to exist, before explicitly searching for them? We hypothesize that there is an important reason that simple-yet-accurate models often do exist. This hypothesis is that the size of the Rashomon set is often large, where the Rashomon set is the set of almost-equally-accurate models from a function class. If the Rashomon set is large, it contains numerous accurate models, and perhaps at least one of them is the simple model we desire. In this work, we formally present the Rashomon ratio as a new gauge of simplicity for a learning problem, depending on a function class and a data set. The Rashomon ratio is the ratio of the volume of the set of accurate models to the volume of the hypothesis space, and it is different from standard complexity measures from statistical learning theory. Insight from studying the Rashomon ratio provides an easy way to check whether a simpler model might exist for a problem before finding it, namely whether several different machine learning methods achieve similar performance on the data. In that sense, the Rashomon ratio is a powerful tool for understanding why and when an accurate-yet-simple model might exist. If, as we hypothesize in this work, many real-world data sets admit large Rashomon sets, the implications are vast: it means that simple or interpretable models may often be used for high-stakes decisions without losing accuracy.
    An Equivalence Principle for the Spectrum of Random Inner-Product Kernel Matrices. (arXiv:2205.06308v1 [math.PR])
    We consider random matrices whose entries are obtained by applying a (nonlinear) kernel function to the pairwise inner products between $n$ independent data vectors drawn uniformly from the unit sphere in $\mathbb{R}^d$. Our study of this model is motivated by problems in machine learning, statistics, and signal processing, where such inner-product kernel random matrices and their spectral properties play important roles. Under mild conditions on the kernel function, we establish the weak-limit of the empirical spectral distribution of these matrices when $d, n \to \infty$ such that $n / d^\ell \to \kappa \in (0, \infty)$, for some fixed $\ell \in \mathbb{N}$ and $\kappa \in \mathbb{R}$. This generalizes an earlier result of Cheng and Singer (2013), who studied the same model in the linear scaling regime (with $\ell = 1$ and $n/d \to \kappa$). The main insight of our work is a general equivalence principle: the spectrum of the random kernel matrix is asymptotically equivalent to that of a simpler matrix model, constructed as the linear combination of a (shifted) Wishart matrix and an independent matrix drawn from the Gaussian orthogonal ensemble. The aspect ratio of the Wishart matrix and the coefficients of the linear combination are determined by $\ell$ and by the expansion of the kernel function in the orthogonal Hermite polynomial basis. Consequently, the limiting spectrum of the random kernel matrix can be characterized as the free additive convolution between a Marchenko-Pastur law and a semicircle law.  ( 2 min )
    Heavy-Tail Phenomenon in Decentralized SGD. (arXiv:2205.06689v1 [stat.ML])
    Recent theoretical studies have shown that heavy-tails can emerge in stochastic optimization due to `multiplicative noise', even under surprisingly simple settings, such as linear regression with Gaussian data. While these studies have uncovered several interesting phenomena, they consider conventional stochastic optimization problems, which exclude decentralized settings that naturally arise in modern machine learning applications. In this paper, we study the emergence of heavy-tails in decentralized stochastic gradient descent (DE-SGD), and investigate the effect of decentralization on the tail behavior. We first show that, when the loss function at each computational node is twice continuously differentiable and strongly convex outside a compact region, the law of the DE-SGD iterates converges to a distribution with polynomially decaying (heavy) tails. To have a more explicit control on the tail exponent, we then consider the case where the loss at each node is a quadratic, and show that the tail-index can be estimated as a function of the step-size, batch-size, and the topological properties of the network of the computational nodes. Then, we provide theoretical and empirical results showing that DE-SGD has heavier tails than centralized SGD. We also compare DE-SGD to disconnected SGD where nodes distribute the data but do not communicate. Our theory uncovers an interesting interplay between the tails and the network structure: we identify two regimes of parameters (stepsize and network size), where DE-SGD %addition of network structure can have lighter or heavier tails than disconnected SGD depending on the regime. Finally, to support our theoretical results, we provide numerical experiments conducted on both synthetic data and neural networks.
    Probabilistic Estimation of Chirp Instantaneous Frequency Using Gaussian Processes. (arXiv:2205.06306v1 [stat.ML])
    We present a probabilistic approach for estimating chirp signal and its instantaneous frequency function when the true forms of the chirp and instantaneous frequency are unknown. To do so, we represent them by joint cascading Gaussian processes governed by a non-linear stochastic differential equation, and estimate their posterior distribution by using stochastic filters and smoothers. The model parameters are determined via maximum likelihood estimation. Theoretical results show that the estimation method has a bounded mean squared error. Experiments show that the method outperforms a number of baseline methods on a synthetic model, and we also apply the method to analyse a gravitational wave data.  ( 2 min )
    Nearly Optimal Algorithms for Linear Contextual Bandits with Adversarial Corruptions. (arXiv:2205.06811v1 [cs.LG])
    We study the linear contextual bandit problem in the presence of adversarial corruption, where the reward at each round is corrupted by an adversary, and the corruption level (i.e., the sum of corruption magnitudes over the horizon) is $C\geq 0$. The best-known algorithms in this setting are limited in that they either are computationally inefficient or require a strong assumption on the corruption, or their regret is at least $C$ times worse than the regret without corruption. In this paper, to overcome these limitations, we propose a new algorithm based on the principle of optimism in the face of uncertainty. At the core of our algorithm is a weighted ridge regression where the weight of each chosen action depends on its confidence up to some threshold. We show that for both known $C$ and unknown $C$ cases, our algorithm with proper choice of hyperparameter achieves a regret that nearly matches the lower bounds. Thus, our algorithm is nearly optimal up to logarithmic factors for both cases. Notably, our algorithm achieves the near-optimal regret for both corrupted and uncorrupted cases ($C=0$) simultaneously.
    Fast Conditional Network Compression Using Bayesian HyperNetworks. (arXiv:2205.06404v1 [cs.LG])
    We introduce a conditional compression problem and propose a fast framework for tackling it. The problem is how to quickly compress a pretrained large neural network into optimal smaller networks given target contexts, e.g. a context involving only a subset of classes or a context where only limited compute resource is available. To solve this, we propose an efficient Bayesian framework to compress a given large network into much smaller size tailored to meet each contextual requirement. We employ a hypernetwork to parameterize the posterior distribution of weights given conditional inputs and minimize a variational objective of this Bayesian neural network. To further reduce the network sizes, we propose a new input-output group sparsity factorization of weights to encourage more sparseness in the generated weights. Our methods can quickly generate compressed networks with significantly smaller sizes than baseline methods.  ( 2 min )
    Weak consistency of the 1-nearest neighbor measure with applications to missing data. (arXiv:1902.02408v3 [math.ST] UPDATED)
    When data is partially missing at random, imputation and importance weighting are often used to estimate moments of the unobserved population. In this paper, we study 1-nearest neighbor (1NN) importance weighting, which estimates moments by replacing missing data with the complete data that is the nearest neighbor in the non-missing covariate space. We define an empirical measure, the 1NN measure, and show that it is weakly consistent for the measure of the missing data. The main idea behind this result is that the 1NN measure is performing inverse probability weighting in the limit. We study applications to missing data and mitigating the impact of covariate shift in prediction tasks.
    Robust and Heterogenous Odds Ratio: Estimating Price Sensitivity for Unbought Items. (arXiv:2106.11389v2 [stat.ME] UPDATED)
    Problem definition: Mining for heterogeneous responses to an intervention is a crucial step for data-driven operations, for instance to personalize treatment or pricing. We investigate how to estimate price sensitivity from transaction-level data. In causal inference terms, we estimate heterogeneous treatment effects when (a) the response to treatment (here, whether a customer buys a product) is binary, and (b) treatment assignments are partially observed (here, full information is only available for purchased items). Methodology/Results: We propose a recursive partitioning procedure to estimate heterogeneous odds ratio, a widely used measure of treatment effect in medicine and social sciences. We integrate an adversarial imputation step to allow for robust estimation even in presence of partially observed treatment assignments. We validate our methodology on synthetic data and apply it to three case studies from political science, medicine, and revenue management. Managerial Implications: Our robust heterogeneous odds ratio estimation method is a simple and intuitive tool to quantify heterogeneity in patients or customers and personalize interventions, while lifting a central limitation in many revenue management data.
    secml: A Python Library for Secure and Explainable Machine Learning. (arXiv:1912.10013v2 [cs.LG] UPDATED)
    We present \texttt{secml}, an open-source Python library for secure and explainable machine learning. It implements the most popular attacks against machine learning, including test-time evasion attacks to generate adversarial examples against deep neural networks and training-time poisoning attacks against support vector machines and many other algorithms. These attacks enable evaluating the security of learning algorithms and the corresponding defenses under both white-box and black-box threat models. To this end, \texttt{secml} provides built-in functions to compute security evaluation curves, showing how quickly classification performance decreases against increasing adversarial perturbations of the input data. \texttt{secml} also includes explainability methods to help understand why adversarial attacks succeed against a given model, by visualizing the most influential features and training prototypes contributing to each decision. It is distributed under the Apache License 2.0 and hosted at \url{https://github.com/pralab/secml}.
    The interplay between ranking and communities in networks. (arXiv:2112.12670v2 [cs.SI] UPDATED)
    Community detection and hierarchy extraction are usually thought of as separate inference tasks on networks. Considering only one of the two when studying real-world data can be an oversimplification. In this work, we present a generative model based on an interplay between community and hierarchical structures. It assumes that each node has a preference in the interaction mechanism and nodes with the same preference are more likely to interact, while heterogeneous interactions are still allowed. The sparsity of the network is exploited for implementing a more efficient algorithm. We demonstrate our method on synthetic and real-world data and compare performance with two standard approaches for community detection and ranking extraction. We find that the algorithm accurately retrieves the overall node's preference in different scenarios, and we show that it can distinguish small subsets of nodes that behave differently than the majority. As a consequence, the model can recognize whether a network has an overall preferred interaction mechanism. This is relevant in situations where there is no clear "a priori" information about what structure explains the observed network datasets well. Our model allows practitioners to learn this automatically from the data.
    Anomaly Detection using Principles of Human Perception. (arXiv:2103.12323v4 [cs.CR] UPDATED)
    In the fields of statistics and unsupervised machine learning a fundamental and well-studied problem is anomaly detection. Anomalies are difficult to define, yet many algorithms have been proposed. Underlying the approaches is the nebulous understanding that anomalies are rare, unusual or inconsistent with the majority of data. The present work provides a philosophical treatise to clearly define anomalies and develops an algorithm for their efficient detection with minimal user intervention. Inspired by the Gestalt School of Psychology and the Helmholtz principle of human perception, anomalies are assumed to be observations that are unexpected to occur with respect to certain groupings made by the majority of the data. Under appropriate random variable modelling anomalies are directly found in a set of data by a uniform and independent random assumption of the distribution of constituent elements of the observations, with anomalies corresponding to those observations where the expectation of the number of occurrences of the elements in a given view is $<1$. Starting from fundamental principles of human perception an unsupervised anomaly detection algorithm is developed that is simple, real-time and parameter-free. Experiments suggest it as a competing choice for univariate data with promising results on the detection of global anomalies in multivariate data.
    $\alpha$-GAN: Convergence and Estimation Guarantees. (arXiv:2205.06393v1 [cs.LG])
    We prove a two-way correspondence between the min-max optimization of general CPE loss function GANs and the minimization of associated $f$-divergences. We then focus on $\alpha$-GAN, defined via the $\alpha$-loss, which interpolates several GANs (Hellinger, vanilla, Total Variation) and corresponds to the minimization of the Arimoto divergence. We show that the Arimoto divergences induced by $\alpha$-GAN equivalently converge, for all $\alpha\in \mathbb{R}_{>0}\cup\{\infty\}$. However, under restricted learning models and finite samples, we provide estimation bounds which indicate diverse GAN behavior as a function of $\alpha$. Finally, we present empirical results on a toy dataset that highlight the practical utility of tuning the $\alpha$ hyperparameter.
    Differentiable Graph Module (DGM) for Graph Convolutional Networks. (arXiv:2002.04999v4 [cs.LG] UPDATED)
    Graph deep learning has recently emerged as a powerful ML concept allowing to generalize successful deep neural architectures to non-Euclidean structured data. Such methods have shown promising results on a broad spectrum of applications ranging from social science, biomedicine, and particle physics to computer vision, graphics, and chemistry. One of the limitations of the majority of current graph neural network architectures is that they are often restricted to the transductive setting and rely on the assumption that the underlying graph is {\em known} and {\em fixed}. Often, this assumption is not true since the graph may be noisy, or partially and even completely unknown. In such cases, it would be helpful to infer the graph directly from the data, especially in inductive settings where some nodes were not present in the graph at training time. Furthermore, learning a graph may become an end in itself, as the inferred structure may provide complementary insights next to the downstream task. In this paper, we introduce Differentiable Graph Module (DGM), a learnable function that predicts edge probabilities in the graph which are optimal for the downstream task. DGM can be combined with convolutional graph neural network layers and trained in an end-to-end fashion. We provide an extensive evaluation of applications from the domains of healthcare (disease prediction), brain imaging (age prediction), computer graphics (3D point cloud segmentation), and computer vision (zero-shot learning). We show that our model provides a significant improvement over baselines both in transductive and inductive settings and achieves state-of-the-art results.
    Upside-Down Reinforcement Learning Can Diverge in Stochastic Environments With Episodic Resets. (arXiv:2205.06595v1 [stat.ML])
    Upside-Down Reinforcement Learning (UDRL) is an approach for solving RL problems that does not require value functions and uses only supervised learning, where the targets for given inputs in a dataset do not change over time. Ghosh et al. proved that Goal-Conditional Supervised Learning (GCSL) -- which can be viewed as a simplified version of UDRL -- optimizes a lower bound on goal-reaching performance. This raises expectations that such algorithms may enjoy guaranteed convergence to the optimal policy in arbitrary environments, similar to certain well-known traditional RL algorithms. Here we show that for a specific episodic UDRL algorithm (eUDRL, including GCSL), this is not the case, and give the causes of this limitation. To do so, we first introduce a helpful rewrite of eUDRL as a recursive policy update. This formulation helps to disprove its convergence to the optimal policy for a wide class of stochastic environments. Finally, we provide a concrete example of a very simple environment where eUDRL diverges. Since the primary aim of this paper is to present a negative result, and the best counterexamples are the simplest ones, we restrict all discussions to finite (discrete) environments, ignoring issues of function approximation and limited sample size.
    Variational Hyper-Encoding Networks. (arXiv:2005.08482v2 [stat.ML] UPDATED)
    We propose a framework called HyperVAE for encoding distributions of distributions. When a target distribution is modeled by a VAE, its neural network parameters \theta is drawn from a distribution p(\theta) which is modeled by a hyper-level VAE. We propose a variational inference using Gaussian mixture models to implicitly encode the parameters \theta into a low dimensional Gaussian distribution. Given a target distribution, we predict the posterior distribution of the latent code, then use a matrix-network decoder to generate a posterior distribution q(\theta). HyperVAE can encode the parameters \theta in full in contrast to common hyper-networks practices, which generate only the scale and bias vectors as target-network parameters. Thus HyperVAE preserves much more information about the model for each task in the latent space. We discuss HyperVAE using the minimum description length (MDL) principle and show that it helps HyperVAE to generalize. We evaluate HyperVAE in density estimation tasks, outlier detection and discovery of novel design classes, demonstrating its efficacy.
    Multiple Domain Causal Networks. (arXiv:2205.06791v1 [stat.ML])
    Observational studies are regarded as economic alternatives to randomized trials, often used in their stead to investigate and determine treatment efficacy. Due to lack of sample size, observational studies commonly combine data from multiple sources or different sites/centers. Despite the benefits of an increased sample size, a naive combination of multicenter data may result in incongruities stemming from center-specific protocols for generating cohorts or reactions towards treatments distinct to a given center, among other things. These issues arise in a variety of other contexts, including capturing a treatment effect related to an individual's unique biological characteristics. Existing methods for estimating heterogeneous treatment effects have not adequately addressed the multicenter context, but rather treat it simply as a means to obtain sufficient sample size. Additionally, previous approaches to estimating treatment effects do not straightforwardly generalize to the multicenter design, especially when required to provide treatment insights for patients from a new, unobserved center. To address these shortcomings, we propose Multiple Domain Causal Networks (MDCN), an approach that simultaneously strengthens the information sharing between similar centers while addressing the selection bias in treatment assignment through learning of a new feature embedding. In empirical evaluations, MDCN is consistently more accurate when estimating the heterogeneous treatment effect in new centers compared to benchmarks that adjust solely based on treatment imbalance or general center differences. Finally, we justify our approach by providing theoretical analyses that demonstrate that MDCN improves on the generalization bound of the new, unobserved target center.
    Clustering with missing data: which equivalent for Rubin's rules?. (arXiv:2011.13694v2 [stat.ME] UPDATED)
    Multiple imputation (MI) is a popular method for dealing with missing values. However, the suitable way for applying clustering after MI remains unclear: how to pool partitions? How to assess the clustering instability when data are incomplete? By answering both questions, this paper proposed a complete view of clustering with missing data using MI. The problem of partitions pooling is here addressed using consensus clustering while, based on the bootstrap theory, we explain how to assess the instability related to observed and missing data. The new rules for pooling partitions and instability assessment are theoretically argued and extensively studied by simulation. Partitions pooling improves accuracy, while measuring instability with missing data enlarges the data analysis possibilities: it allows assessment of the dependence of the clustering to the imputation model, as well as a convenient way for choosing the number of clusters when data are incomplete, as illustrated on a real data set.
    The Design Space of E(3)-Equivariant Atom-Centered Interatomic Potentials. (arXiv:2205.06643v1 [stat.ML])
    The rapid progress of machine learning interatomic potentials over the past couple of years produced a number of new architectures. Particularly notable among these are the Atomic Cluster Expansion (ACE), which unified many of the earlier ideas around atom density-based descriptors, and Neural Equivariant Interatomic Potentials (NequIP), a message passing neural network with equivariant features that showed state of the art accuracy. In this work, we construct a mathematical framework that unifies these models: ACE is generalised so that it can be recast as one layer of a multi-layer architecture. From another point of view, the linearised version of NequIP is understood as a particular sparsification of a much larger polynomial model. Our framework also provides a practical tool for systematically probing different choices in the unified design space. We demonstrate this by an ablation study of NequIP via a set of experiments looking at in- and out-of-domain accuracy and smooth extrapolation very far from the training data, and shed some light on which design choices are critical for achieving high accuracy. Finally, we present BOTNet (Body-Ordered-Tensor-Network), a much-simplified version of NequIP, which has an interpretable architecture and maintains accuracy on benchmark datasets.
    Generalized Variational Inference in Function Spaces: Gaussian Measures meet Bayesian Deep Learning. (arXiv:2205.06342v1 [stat.ML])
    We develop a framework for generalized variational inference in infinite-dimensional function spaces and use it to construct a method termed Gaussian Wasserstein inference (GWI). GWI leverages the Wasserstein distance between Gaussian measures on the Hilbert space of square-integrable functions in order to determine a variational posterior using a tractable optimisation criterion and avoids pathologies arising in standard variational function space inference. An exciting application of GWI is the ability to use deep neural networks in the variational parametrisation of GWI, combining their superior predictive performance with the principled uncertainty quantification analogous to that of Gaussian processes. The proposed method obtains state-of-the-art performance on several benchmark datasets.
    Explaining by Removing: A Unified Framework for Model Explanation. (arXiv:2011.14878v2 [cs.LG] UPDATED)
    Researchers have proposed a wide variety of model explanation approaches, but it remains unclear how most methods are related or when one method is preferable to another. We describe a new unified class of methods, removal-based explanations, that are based on the principle of simulating feature removal to quantify each feature's influence. These methods vary in several respects, so we develop a framework that characterizes each method along three dimensions: 1) how the method removes features, 2) what model behavior the method explains, and 3) how the method summarizes each feature's influence. Our framework unifies 26 existing methods, including several of the most widely used approaches: SHAP, LIME, Meaningful Perturbations, and permutation tests. This newly understood class of explanation methods has rich connections that we examine using tools that have been largely overlooked by the explainability literature. To anchor removal-based explanations in cognitive psychology, we show that feature removal is a simple application of subtractive counterfactual reasoning. Ideas from cooperative game theory shed light on the relationships and trade-offs among different methods, and we derive conditions under which all removal-based explanations have information-theoretic interpretations. Through this analysis, we develop a unified framework that helps practitioners better understand model explanation tools, and that offers a strong theoretical foundation upon which future explainability research can build.
    Improving Sequential Query Recommendation with Immediate User Feedback. (arXiv:2205.06297v1 [cs.IR])
    We propose an algorithm for next query recommendation in interactive data exploration settings, like knowledge discovery for information gathering. The state-of-the-art query recommendation algorithms are based on sequence-to-sequence learning approaches that exploit historical interaction data. We propose to augment the transformer-based causal language models for query recommendations to adapt to the immediate user feedback using multi-armed bandit (MAB) framework. We conduct a large-scale experimental study using log files from a popular online literature discovery service and demonstrate that our algorithm improves the cumulative regret substantially, with respect to the state-of-the-art transformer-based query recommendation models, which do not make use of the immediate user feedback. Our data model and source code are available at ~\url{https://anonymous.4open.science/r/exp3_ss-9985/}.
    EE-Net: Exploitation-Exploration Neural Networks in Contextual Bandits. (arXiv:2110.03177v8 [cs.LG] UPDATED)
    In this paper, we propose a novel neural exploration strategy in contextual bandits, EE-Net, distinct from the standard UCB-based and TS-based approaches. Contextual multi-armed bandits have been studied for decades with various applications. To solve the exploitation-exploration tradeoff in bandits, there are three main techniques: epsilon-greedy, Thompson Sampling (TS), and Upper Confidence Bound (UCB). In recent literature, linear contextual bandits have adopted ridge regression to estimate the reward function and combine it with TS or UCB strategies for exploration. However, this line of works explicitly assumes the reward is based on a linear function of arm vectors, which may not be true in real-world datasets. To overcome this challenge, a series of neural bandit algorithms have been proposed, where a neural network is used to learn the underlying reward function and TS or UCB are adapted for exploration. Instead of calculating a large-deviation based statistical bound for exploration like previous methods, we propose "EE-Net", a novel neural-based exploration strategy. In addition to using a neural network (Exploitation network) to learn the reward function, EE-Net uses another neural network (Exploration network) to adaptively learn potential gains compared to the currently estimated reward for exploration. Then, a decision-maker is constructed to combine the outputs from the Exploitation and Exploration networks. We prove that EE-Net can achieve $\mathcal{O}(\sqrt{T\log T})$ regret and show that EE-Net outperforms existing linear and neural contextual bandit baselines on real-world datasets.
    Change-point Detection and Segmentation of Discrete Data using Bayesian Context Trees. (arXiv:2203.04341v2 [stat.ME] UPDATED)
    A new Bayesian modelling framework is introduced for piece-wise homogeneous variable-memory Markov chains, along with a collection of effective algorithmic tools for change-point detection and segmentation of discrete time series. Building on the recently introduced Bayesian Context Trees (BCT) framework, the distributions of different segments in a discrete time series are described as variable-memory Markov chains. Inference for the presence and location of change-points is then performed via Markov chain Monte Carlo sampling. The key observation that facilitates effective sampling is that, using one of the BCT algorithms, the prior predictive likelihood of the data can be computed exactly, integrating out all the models and parameters in each segment. This makes it possible to sample directly from the posterior distribution of the number and location of the change-points, leading to accurate estimates and providing a natural quantitative measure of uncertainty in the results. Estimates of the actual model in each segment can also be obtained, at essentially no additional computational cost. Results on both simulated and real-world data indicate that the proposed methodology performs better than or as well as state-of-the-art techniques.

  • Open

    [D] Multi-modal information sharing strategies when one "modality" is tabular data.
    I'm hunting for literature on architectures for multimodal autoencoders when one "modality" is tabular data. Keyword searches aren't getting me what I want though. I have a mixture of sequential, audio, image, and tabular data. For the non-tabular modalities it seems like cross attention in the encoder/decoder is the obvious choice, but I'm not sure how to best incorporate the tabular component. Options seem like 1) use a standard ffn on the tabular data and connect it to the shared latent space. But this seems not ideal if there are feature interactions between tabular and other modalities. Or, 2) put tabular data through a ffn and repeat and concatenate it to the input of each other modality. Then add some pooling layers in the decoder then a fc layer for reconstruction of the tabular data. This seems inefficient because you replicate the tabular data across modalities. Is there anything as elegant as cross attention but for cross modal fusion with tabular data? submitted by /u/WigglyHypersurface [link] [comments]  ( 1 min )
    [D] Any primer on all the major image generators and how they compare to each other?
    It seems every other week another image generator comes out. Mostly countless Dalle versions at this point. ru-Dalle, mini-Dalle, dalle flow, dalle mega. Other than that the original Dalle and dalle 2 are from Open AI and presumably are better than all the others, I know very little about them. Is there any consolidated primer on what are the major/most promising image generators, what makes them unique, what source/interface are available for them and how they stack up against each other, parameters etc? Doesn't need to be superdetailed. Just a short orientation. submitted by /u/cyborgsnowflake [link] [comments]  ( 1 min )
    [D] Is the standard path of a researcher not effective anymore?
    Hello All! I am currently a master's student. I want to work as an AI researcher, so the typical path I know of is pursuing a Ph.D. and then trying to join a respected research lab. However, I have stumbled upon a couple of discussions in blog posts, Twitter, and in this subreddit suggesting that the current progress of those labs is more engineering than science; most current research papers, including ones at top conferences, aim at the incremental improvement of current methodologies instead of breakthroughs and without any understanding of why these models work. A reason for this appears to be that research in deep learning is mainly influenced by giant tech companies that favor short-term progress instead of long-term ones. Another one appears to be the broken incentive structures in research in general and these broken incentives' bad influence have been amplified in this field leading researchers to become paper producers rather than progress producers. So, am I missing the big picture? Are there other paths? Are the incentive structures too broken that someone could make better progress by doing research on their own or by joining open labs like ML Collective? submitted by /u/TryingToGeek [link] [comments]  ( 2 min )
    [R] Locating and Editing Factual Knowledge in GPT
    https://i.redd.it/rktep95sdpz81.gif Here's an interesting paper from MIT CSAIL on a method to edit knowledge within language models — sets up a nice framework for knowledge editing and representation, and a reference dataset to test facts within a model. Also has full code & colab notebooks to edit your own facts within GPT: https://rome.baulab.info/ + https://github.com/kmeng01/rome + https://colab.research.google.com/github/kmeng01/rome/blob/main/notebooks/rome.ipynb. Abstract: We investigate the mechanisms underlying factual knowledge recall in autoregressive transformer language models. First, we develop a causal intervention for identifying neuron activations capable of altering a model’s factual predictions. Within large GPT-style models, this reveals two distinct sets of neurons that we hypothesize correspond to knowing an abstract fact and saying a concrete word, respectively. This insight inspires the development of ROME, a novel method for editing facts stored in model weights. For evaluation, we assemble COUNTERFACT, a dataset of over twenty thousand counterfactuals and tools to facilitate sensitive measurements of knowledge editing. Using COUNTERFACT, we confirm the distinction between saying and knowing neurons, and we find that ROME achieves state-of-the-art performance in knowledge editing compared to other methods. An interactive demo notebook, full code implementation, and the dataset are available at https://rome.baulab.info/. submitted by /u/Quantum_Network [link] [comments]  ( 1 min )
    [D] A Brief History & The Current State of Anime Generating AI
    submitted by /u/farfromhome2020 [link] [comments]
    [N] Is Deepmind's "Gato" a precursor for general artificial intelligence? According to Gary Marcus, most certainly not.
    submitted by /u/much_successes [link] [comments]  ( 2 min )
    [D] If random search works so well, why is there no published paper that compares it to RL?
    This is not a sarcastic question! I'm reading the excellent series An Outsider's Tour of Reinforcement Learning and was very interested in finding out that simple random search random search is actually a very good algorithm in a good variety of benchmarks. If simple random search works so well and, being fast, allows us to evaluate performance more thoroughly, why isn't there a more developed line of research exploring its application to optimal control? Why is there no paper saying "hey RL community, this is the new state of the art"? submitted by /u/GorillaWithGroove [link] [comments]  ( 2 min )
    [P] DALL·E Mega Website
    submitted by /u/tomd_96 [link] [comments]  ( 1 min )
    [D] Can I re-publish a Medium post later as an arXiv paper?
    Does anyone know about legal implications of either platform? Has anybody done a similar thing, e.g. for better citation? Thanks submitted by /u/ihatebeinganonymous [link] [comments]  ( 1 min )
    [R] Symphony Generation with Permutation Invariant Language Model
    submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 2 min )
    [D] What is the best practice for adding new samples to the training set of a model once a new sample is discovered/acquired?
    Let’s say I have a trained model for which I used a training set T. If new samples become available, what is the best practice for using this new information to improve the model? My thoughts are: 1) create a new training set T’ that contains the new samples and retrain the model 2) fine-tune the trained model by training a few epochs on only the new samples Furthermore, if the new acquired samples are more important to the task than the ones in the original T, is there a way I can bias the training towards the new samples? submitted by /u/StOchastiC_ [link] [comments]  ( 2 min )
    [R] Are there any publicly available models for non-autoregressive text generation?
    I am building a non-autoregressive text generation model. For evaluation I use recently proposed MAUVE score. The problem is that all previous works use different metrics, such as BLEU/self-BLEU etc. To compare my model with the previous ones I want to evaluate MAUVE score on their output. But I couldn't find any downloadable trained models for recent papers such as SUNDAE or ScratchGAN. On hugginface there are plenty of gpt-like moldels (autoregressive), but I don't know how to find non-autoregressive ones there. submitted by /u/Tomarchelone [link] [comments]  ( 1 min )
    [D] Open Source library to do automatic EDA + experiment tracking in a spreadsheet
    Hi All, We have released a new version of python library, VevestaX. The library does automatic EDA and experiment tracking in a spreadsheet. The library can be downloaded using: pip install vevestaX Following is the link to its demo: https://youtu.be/7jmnIOqBpJM Following is the github link: https://github.com/Vevesta/VevestaX/blob/main/README.md Following is the sample output spreadsheet: https://docs.google.com/spreadsheets/d/15lOXzpcUQtkYQAEnx-YTegvg8zCW6pEK/edit?usp=sharing&ouid=103382336064969333270&rtpof=true&sd=true Please give us a github star, it would mean the world for us. Please mail your feature requests to OP at vevestax@vevesta.com submitted by /u/vevesta [link] [comments]  ( 1 min )
  • Open

    Methods to guarantee stability of actor-critic based learning ?
    Hi, coming from a control systems background, is there any mathematical trickery to prove/guarantee/99% guarantee the stability of a ⠀converged policy? I am thinking something appreciate DDPG to start with? submitted by /u/_Arrietty [link] [comments]  ( 1 min )
    Methods to guarantee stability of actor-critic based learning ?
    Hi, coming from a control systems background, is there any mathematical trickery to ‏‏‎ prove/guarantee/99% guarantee the stability of a converged policy? I am thinking something appreciate DDPG to start with? submitted by /u/_Arrietty [link] [comments]  ( 1 min )
    Sampling a probabilistic action space for DQN
    I think its relatively well understood why we wouldn't sample probabilistic values for actions taken during learning: as the policy-learning is a regression problem, thus linear output (rather than activating the output with Relu/Softmax) of feature layers is necessary. However I want to know whether we can (or even should) be sampling the action space using softmax for non-learning steps. For eample, during training the DQN cycle is: record current observation -> take action -> record next observation -> save transition, (s, a, r, s') + some termination indicator -> learn step (sample batch/mini-batch of transitions to learn from). In this case, what would happen if we took a probabilistic distibution of actions (e.g. softmax on output of q-network) during exploration (the non-learning step)? Would this even do anything? ​ (Correct me if I have mispoken anywhere, thank you) submitted by /u/Background-Cable-491 [link] [comments]  ( 1 min )
    Loss doesn't decrease in Deep Q Learning
    I am training a DQL NN to learn to play tic-tac-toe against the optimal player. I am using a Huber Loss. It seems to work since the average reward increases with training, but the loss increases as well, which I was not expecting. Is this a clue that something is wrong or is it normal? submitted by /u/alesaso2000 [link] [comments]  ( 1 min )
    If we have a reward function that works perfectly for this, would adding another action such as steering have a significant effect on the possible optimal policy?
    If we have a working reward function, providing the desired behavior and optimal policy in a continuous action/state-space problem, would adding another action significantly affect the possible optimal policy? for example, assume you have an RL problem with an action space of 1 (de/acceleration), state-space of 2 (distance from position and velocity), and the agent is tasked to accelerate in a straight line from position a to b. do you think the agent would behave majorly differently? - I'm under the assumption that there would be minimal change aside from a longer training time assuming enough exploration, as the task is to still move in a straight line, but the agent would only have to account for steering action too now submitted by /u/philori [link] [comments]  ( 1 min )
    Does anyone know good python sources hardcoded of RL?
    As the title says, i would like to build my own RL agent but without using ready-to-build models such as baseline13 or tensorflow or pytorch but with hardcoded mathematical formula. submitted by /u/GarantBM [link] [comments]  ( 1 min )
    Learning policies based on mix of algorithms
    Dear RL community, I currently play around with using mixed RL algorithms on the same DRL policy network. Thus I can make use of expert MDP trajectories (e.g. BC), and offline simulation training (e.g. Off-Policy SAC) as well as online improvement (e.g. On-Policy PPO). While the algorithms greatly differ to each other, the resulting policy network tends to improve across stages, with a bump in performance during stage transitions. Have you tried something like this in before? How did it turn out for you in comparison to single algorithm approaches? submitted by /u/frugaleringenieur [link] [comments]  ( 1 min )
  • Open

    Artificial Intelligence in Social Media Enriches Marketing ROI
    Cos Leveraging Machine Learning Models in Social Media Platforms: What Motivates Social Media Marketers?   The intersections of marketing with artificial intelligence (AI) technology have opened up fascinating opportunities for marketers of all stripes. Stridently, AI has been making incessant roads into social media marketing, and rightly so. As the number of social media users rises… Read More »Artificial Intelligence in Social Media Enriches Marketing ROI The post Artificial Intelligence in Social Media Enriches Marketing ROI appeared first on Data Science Central.  ( 4 min )
    Why Gato from Deepmind is a game changer
    This week, there was (yet another) game changing announcement from the folks at Deepmind named Gato Gato is a cool cat 🙂 It leads us closer on the road to AGI because Gato is a transformer model that can do multiple things like caption images, chat with people, play games etc That means you train… Read More »Why Gato from Deepmind is a game changer The post Why Gato from Deepmind is a game changer appeared first on Data Science Central.  ( 3 min )
  • Open

    Managing Cloud Infrastructure Services With TPI Plugin For Terraform
    Terraform Provider Iterative, a new tool by Iterative for extending the capabilities of HashiCorp's Terraform, for optimizing machine learning engineers' cloud experience with automated cloud infrastructure tools: Iterative Introduces New ML Tool TPI Plugin For HashiCorp's Terraform Cloud Infrastructure Service TPI makes the Terraform for machine learning experience more consistent and less expensive with features including spot recovery improvements, automatic cleanup of unused resources, easy switching between cloud service vendors, and more. submitted by /u/thumbsdrivesmecrazy [link] [comments]  ( 1 min )
    Price recommendation for e-comm | methodology suggestions needed
    Hi everyone, Hope you are having a great weekend. I work for a small e-comm in my country as data scientist. My first project (tight timeline) is to recommend price increase for Products to give a small boost to revenue. I have fixed and curated all the transaction data that I might need but struggle with actually modelling the problem. Current approach I am working towards: 1. Build a log-log model for each product to identify Inelastic products 2. Log log model to be built for each product at daily level aggregate with cross elasticity of other products 3. Feature space for each product model; volume of the product sold in the day as dependent and other product price as price index relative to the product. Other seasonal features 3. Definition for inelasticity would be if the increase in price results in overall increase in revenue (this is because I want to take into the cross elasticity with other product and the impact of price change of the parent product on volume of unchanged product in the same category hierarchy) Challenges I am facing 1. R2 is relatively okay but finding it difficult to show some back testing to business to gain their confidence since R2 is a metric they do not understand/trust 2. When building the model at daily level not all products get sold on same day so there is a lot of missing cross product price index feature Any suggestions on improving the methodology will be very appreciated and for saving me from this pinch I would love to offer my time in future for any collaboration if needed or if it helps in any way submitted by /u/jimmyiceman [link] [comments]  ( 1 min )
    I created a DALL·E Flow website
    submitted by /u/tomd_96 [link] [comments]  ( 1 min )
    Microsoft Research Introduces i-Code: An Integrative and Composable Multimodal Machine Learning Framework
    Machine learning has long aimed to provide models with intelligence comparable to humans. Humans can automatically blend multiple sensory inputs like visual, linguistic, and acoustic signals to generate a complete knowledge of their surroundings by virtue of their intelligence. Even the most robust pre-trained AI models, in contrast to humans, are incapable of doing so, confining themselves to one or two input modalities. Researchers have always been interested in developing effective multimodal learning strategies to support this viewpoint. In their new paper, to further support this idea, the Microsoft Azure Cognitive Services Research team proposes a self-supervised pretraining framework names i-Code: An Integrative and Composable Multimodal Learning Framework. Continue Reading Paper: https://arxiv.org/pdf/2205.01818.pdf submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    AGI Survey for my Ethics Class
    I created a four question survey on the topics of medical AI ethics and artificial empathy. Answers are anonymous and responses are appreciated, thank you! Artificial Therapists submitted by /u/VoltGe [link] [comments]
    Need advice, looking for a chatbot with predefined knowledge
    Hi, I am a game developer and I am thinking about a making a game with a procedurally generated world. I'd like this world to have npcs that the player can talk to. Trouble is I can't write the dialog for all characters if the world is procedural, so the content of the player-npcs conversation is different with each play. There are two ways to solve this. a) The normal way. I can write some dialog and have it be full of variables and variants that change to match what is procedurally generated. There are big drawbacks to this though. There will still be a lot of manual work, which means this solution doesn't scale well. If I realize I need 2 times more content, i will have to do two times more work, probably more, since even the previous content will get more complex with more new variables. b) A general solution that doesn't need the manual wiring for each conversation. I've seen some great stuff with gpt-3 and gpt-neoX, but that seems like a different league entirely. My problem looks like this: The player engages in conversation with an npc. The npc, say a random trader, has some static behaviour, like prompting trade, or quests. But appart from that, the player has a bunch of predefined questions that he can ask the npc. Such as: Who are you? What's new in town? Where is the local blacksmith? .. What I need to do is collect all the data that this npc should know about, and have him interpret and understand the players question. The npc/ai recognizes that the player is asking for directions, asks the game to calculate response data (like: it's north from here, next to the tavern), and the npc/ai then synthetizes human speech like text containing the data as a response to the player. Is there a machine learning based solution that can help me implement the general solution to this problem? Thanks submitted by /u/Roggi44 [link] [comments]  ( 2 min )
    AI Researchers From Universidad Rey Juan Carlos, Spain Propose A New Method Of Contact Deformations Machine Learning For Real-Time Dynamic Simulation
    The modeling of touch and deformations has piqued computer graphics’ interest since it allows computer-generated models of persons and their surroundings to come to life. Despite significant advances in the domain, scientists still struggle to replicate high-resolution contact at interactive speeds. Many researchers are looking into ways to incorporate machine-learning approaches to model contact-driven deformations, inspired by their success in modeling self-driven deformations or deformations that occur due to an object’s motion. The methods developed so far learn rich nonlinear deformations as a function of the subspace state by using a subspace representation of the deformable object. However, the ML algorithms modeling contact deformation either simulate only smooth global contact responses or exhibit extremely restricted 3D interactions. According to the researchers, the previously developed models have some limitations. Deformations are modeled in an object-centric way, which is a good choice for self-driven deformations since it is smooth with regard to the object’s subspace state, and machine learning achieves high generalization even from sparse data. Contact-driven deformations, on the other hand, are not smooth about the object’s state. Therefore machine learning of these deformations would necessitate intensive sampling of the object’s subspace state. This is problematic because the configuration space is huge and difficult to cover. Continue Reading Paper: http://mslab.es/projects/ContactCentricLearning/contents/Romero\_SIG2022\_final.pdf Github: http://mslab.es/projects/ContactCentricLearning/ submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Ai background remover tool
    submitted by /u/Alive_Ad_2882 [link] [comments]
  • Open

    NEUROMORPHIC COMPUTING WILL NEED PARTNERS TO BREAK INTO THE DATACENTER
    submitted by /u/nnnaikl [link] [comments]

  • Open

    [P] How to do multivariate time series classification using C# and either the Accord.NET or Encog libraries?
    I have a time series based on financial security prices with additional features. I wish to feed this series into some ML construct in order to perform multi-class classification. Most of the solutions that I found in my search offer predictions. I am not interested in predicting future prices. I merely wish to train the ML construct to offer the most likely class for the given time-series input frame. I am looking for C# solutions or links to tutorials that use either the Accord.NET, Encog or ML.NET libraries. I would be most appreciative for answers that lead my eyes to view C# code that demonstrates a solution to my question. In lieu of the above, I would also appreciate a description of the types of ML constructs that would satisfy my requirements. I have no interest in Python solutions. Please, do not chastise me or praise Python. I need the code to be in C# so that it easily integrates into existing code. Thank you. Edit: I forgot to include ML.NET. submitted by /u/LeftShoeHighway [link] [comments]  ( 1 min )
    [P] I need help finding an AI that tells you what sports career is best for you.
    You ask a set of questions and it will use your answers to tell you what sports career is best for you. I have been having a hard time finding it online. Can anyone lend a hand? It would be greatly appreciated. submitted by /u/Texidork [link] [comments]  ( 1 min )
    [D] AI stocks
    The advances in AI the last two years have been mindblowing, I have taken som parttime MLclasses just to try to get a grasp. And wow, im impressed of what someone like me can do with low coding skills but high willingnes to learn. I have already built a recomendation model to improve my policy paragraphs based on input text from relevant research articles. I have to admitt that gramerly and quillbot beats my hobby project, but it was a fun run and I have gotten a ton of experience. One of them is that AI is clearly the future, and I want to place some of my investments in AI as a sector. Buy and hold for the future. The stockmarked is plumeting and will probably continue to do so for a while, but I want to start to research my options. Can you guys share your knowledge of tradable businesses, either pure AI conpanies or parentcompanies with controll? Im all for responsible trading, but feel free to share uncertain yolo companies as well. submitted by /u/sikkerhetellersafety [link] [comments]  ( 1 min )
    [P] I made an open-source demo of OpenAI's CLIP model running completely in the browser - no server involved. Compute embeddings for (and search within) a local directory of images, or search 200k popular images from Reddit (as shown in this video). Link to demo and Github repo in comments.
    submitted by /u/joerocca [link] [comments]  ( 2 min )
    Which Alg to use? [R]
    Hey, So I am taking a few images and wanting to use them to train a model to predict how much a test image matches up to the training images. Would using a CNN be my best bet or using haarcascade classifiers? Any other thoughts? I am doing this in Python on google Collab. Thanks! submitted by /u/Cloverdover1 [link] [comments]  ( 1 min )
    [Project] Volunteers Needed for Ukraine Project
    We are recruiting volunteers for a project that will help Ukraine. This is a data-oriented project, and we can use all the help we can get. We want to work very intensely on this project so we can release it quickly. To join us and help Ukraine, please reach out to [breaker25789@gmail.com](mailto:breaker25789@gmail.com) with your name, email, and the team you are interested in. Data Team · No prior skills necessary. New volunteers will receive training in identifying soldiers and military equipment upon joining our team. · This role takes a minimum of five (5) hours a week. · Minimum Age: 18+ · CONTENT WARNING: The primary role of a member of the Data Team is to directly interact with photos and videos from the war in Ukraine, which often contain graphic images of violence and death. Machine Learning Team · Each volunteer needs to be able to dedicate a minimum of ten (10) hours a week. · Preferred prior experience includes familiarity with Docker, AWS SageMaker and S3, machine learning attacks, machine learning security, dedicated red team work, and/or data science. · Minimum Age: 18+ · CONTENT WARNING: Individuals directly involved in training certain algorithms will be exposed to photos and videos from the war in Ukraine, which often contain graphic violence and death. Please notify us if you would prefer to not see that content. submitted by /u/OttersAreDevilSpawn [link] [comments]  ( 2 min )
    [D] Research Director at Deepmind says all we need now is scaling
    submitted by /u/SnoozeDoggyDog [link] [comments]  ( 5 min )
    [P] Image Fusion Techniques for Image classification Task
    Can anyone recommend sources on image fusion techniques (preferably for RGB and near-infrared images) for image classification tasks. submitted by /u/Antman-007 [link] [comments]
    [D] Taking derivative of Expectation with respect to Phi (Variational Inference)
    The snippet below from page 20 of the paper here mentions that derivative cannot be taken inside the expectation as expectation is a function of phi. ​ https://preview.redd.it/jo4e2nt0lgz81.png?width=1148&format=png&auto=webp&s=1f3acb968a07aa3858657770c12689031105a03d However, the paper here (Page 3) from the same author shows score function estimator being used to estimate gradient that takes the derivative inside expectation even when expectation is a function of phi. Highlighted in the snippet below: ​ https://preview.redd.it/fsv8s272lgz81.png?width=940&format=png&auto=webp&s=c5ad403563d1f9c8048f95a413eb188c35e785c3 I am not being able to understand the discrepancy. Could anyone please help me get insight on this? I feel that I am missing something. submitted by /u/That-Mud3051 [link] [comments]  ( 1 min )
    [D] Best resources to keep up with latest machine learning research
    I am an 'applied' machine learning researcher, i.e. about 80% of my time is on machine learning, 20% is applying it to physics problems. The increasing breadth and depth of new machine learning research is awe-inspiring. I would like to be able to keep up with the newest developments in the field, somehow, without obviously having the time to read all the latest developments. Is there a website, a resource (like weekly or monthly magazines), or community aimed at collating the newest insights and directions and publishing summaries/overviews in digestible formats? submitted by /u/intheprocesswerust [link] [comments]  ( 2 min )
    [R] Predicting the Elections with Deep Learning - Part 1 - Results
    Hi everyone 👋, During the last 18 months, I played — maybe too much — with census & elections data and tried to predict the french elections with deep learning. I'm not officially a ML researcher, that's only a side project, so be nice please 🙏 ! I'm sharing this only to have fun discussions with people who have the same kink as me. 😊 Here it is 👉 Predicting the Elections with Deep Learning - Part 1 - Results You don't have to be a data scientist to read this 1st post which talks only about the results of the experiment. TLDR of this 1st post: No I didn't find a way to predict the elections results But It's surprising how much you can learn about voters behavior (who votes for which party), even using only aggregated public data. It's only qualitative results so it's not real hard science. Still, it's interesting (worrying?) to think about what would be possible with more data (e.g: FAANG) After that, I'll write 2 more posts: 1 for the model implementation and 1 for the MLOps tooling I've used. submitted by /u/qchenevier [link] [comments]  ( 3 min )
    [P] Java implementation of GPT2 tokenizer
    GPT2 Tokenizer Java When developing a service using the GPT3 API, we often need to count the number of tokens. However, if you develop a service in Java, it is not easy to count this. GPT3 is known to use the same tokenizer as GPT2, so this should be a huge help for someone. For more detail and code, refer to https://github.com/hyunwoongko/gpt2-tokenizer-java. Requirements Please install the following dependencies to use the library. ``` implementation 'com.google.api-client:google-api-client:1.32.2' implementation 'org.apache.commons:commons-lang3:3.12.0' implementation 'org.springframework.boot:spring-boot-starter-web' testImplementation 'org.junit.jupiter:junit-jupiter-api:5.3.1' testRuntimeOnly 'org.junit.jupiter:junit-jupiter-engine:5.3.1' ``` Add tokenizer files to resources directory Please add encoder.json and vocab.bpe files to your project resources directory. these files can be found here. Usage The following are simple examples of this library. To check test code for this, refer to here. Encoding text to tokens ```java import ai.tunib.tokenizer.GPT2Tokenizer; import java.util.List; GPT2Tokenizer tokenizer = GPT2Tokenizer.fromPretrained("PATH/IN/RESOURCES"); List result = tokenizer.encode("Hello my name is Kevin."); [15496, 616, 1438, 318, 7939, 13] ``` Decoding tokens to text ```java import ai.tunib.tokenizer.GPT2Tokenizer; GPT2Tokenizer tokenizer = GPT2Tokenizer.fromPretrained("PATH/IN/RESOURCES"); String result = tokenizer.decode(List.of(15496, 616, 1438, 318, 7939, 13)); "Hello my name is Kevin." ``` License This project is licensed under the terms of the Apache License 2.0. Copyright 2022 Hyunwoong Ko. All Rights Reserved. submitted by /u/hyunwoongko [link] [comments]  ( 1 min )
    [D] [R] AdaVAE: Exploring Adaptive GPT-2s in Variational Auto-Encoders for Language Modeling
    https://arxiv.org/abs/2205.05862 It looks promising, what do you think? submitted by /u/carl__11 [link] [comments]
    [D] LayoutLM: Pre-training of Text and Layout for Document Image Understanding (Paper Summary)
    Pre-training of models for NLP applications, exclusively focus on text-level manipulation, while neglecting layout and style information that is vital for document image understanding. This paper proposes LayoutLM that jointly model interactions between text and layout information across scanned document images. Fits very well in use-cases like Resume parsing, Bills parsing, Table parsing, etc. Per Summary: https://youtu.be/ewyDVIdKXm0 Paper Link: https://arxiv.org/abs/1912.13318 submitted by /u/prakhar21 [link] [comments]  ( 1 min )
    [D]Geometric perspective of meta-learning
    I am curious if there is some work explaining the efficiency of meta-learning from a geometric perspective. For example, the famous meta-learning algorithm MAML aims to find parameters of a neural network model that can quickly adapt to new tasks, intuitively it is very similar to finding a point in the parameter space that has short distances between all the optimal parameters of training tasks, i.e a barycenter under some distance metric, and then the adaptation step may be considered as a transport path. This is just a simple (maybe naive) example, would be great if you can share some related work. Thank you! submitted by /u/mike_shen [link] [comments]  ( 1 min )
    [D] ICML author notifications
    The author notifications for ICML submissions will be released 14 May (Anywhere on Earth time). I thought to start a discussion thread here so we can discuss any issues and share commiserations/celebrations. Note the release date means it could be as late as end of day on 14 May Anywhere on Earth time, effectively meaning 15 or even 16 May for some people, depending on where you live and if there are delays. For reference: the notification of phase 1 rejections were delayed by some hours (around 6 or 12, but not more than 24 iirc); and the author feedback period seemed to open on time, maybe even early. I think the main thing to remember is that no matter the result, your work still has value and you still have value. You are not your work, and any evaluation on the value of your work is just a few people's opinions in a noisy, messy system. You can check Anywhere on Earth time here: https://time.is/Anywhere_on_Earth submitted by /u/tfburns [link] [comments]  ( 3 min )
    [P] Identifying if there are any clusters in a data warehouse
    Hi, I have a lot of structured data that have points that should and are random (i.e we don’t want a cluster of points in space but are and should be sparse). The data is basically a dataset that had leak issues. The problem statement is find if there’s any subset of data with the same root cause (i.e same brand or same country caused the issue) There’s a mixture of numerical and categorical data. (I.e size, lat, long), (country, brand, priority, city). Assume there’s around 20+ dimensions (columns) and my first approach was to find if those points have any cluster. To do so, I was going to use a density based cluster algorithm since you don’t have to specify the number of clusters and it will just ignore noise (whichbshould be most of the points). But hard to preserve the meaning of the columns with dimensionality reduction and obviously we can’t cluster in 20 dimensions, it would take forever. We don’t know what could cause the leaks, and we don’t know which column is important or not. What’s an alternative approach or a better solution? Thanks ​ (Note my post got taken down so reposting) submitted by /u/micdean19 [link] [comments]  ( 2 min )
    How Useful Is Small Data Experimentation? [Discussion]
    Hey all, I'm a front end engineer with an interest in building user interfaces for machine learning tools. I'm wanting to build a library for creating web user interfaces that leverage numpy, scipy, sklearn, and any other data science library supported by Pyodide (a tool for running Python in the web browser). The advantage is that there's no environment setup, can run anywhere, and that complex user interfaces can be built easily using existing web frameworks (like React, Vue, etc). The limitations of this is that since it's running in the browser, runtimes will be longer (1-12x longer according to documentations), and working with bigger data sets may not be feasible. Also, we'd have access to only a few of the data science libraries available in Python (mentioned above). So, really,…  ( 2 min )
  • Open

    Gato & AGI doubts
    Just read https://thenextweb.com/news/deepminds-astounding-new-gato-ai-makes-fear-humans-will-never-achieve-agi? I didn't read the whole article but based off the title I'm guessing the author of the article doubts AGI will happen because of what he sees with Gato? Why would he think that? Maybe I'm missing something submitted by /u/Ashamed-Asparagus-93 [link] [comments]  ( 1 min )
    Artificial Intelligence Books: These 10 Sci-Fi Novels You Must Read
    submitted by /u/much_successes [link] [comments]  ( 1 min )
    New ML tool to help data scientists manage cloud workloads with Terraform
    submitted by /u/thumbsdrivesmecrazy [link] [comments]
    GALLERY
    submitted by /u/cookingandcraft [link] [comments]
    Artificial Intelligence Implications: The Future of Formula One | This is my second post in a university blog series surrounding AI, The Future, and a personal area of interest! Would love some feedback and advice for future episodes!
    submitted by /u/RvZz11 [link] [comments]  ( 1 min )
    Seeking the perfect song recommendation method
    I hope you are doing as fine as you can. I am investigating and researching solutions to a problem for a few months now and I need your valuable help today. Problematic: Generate a playlist based on multiple users' music tastes. The goal is that the playlist suits everyone, at the higher level of satisfaction for every user. Research path Spotify and recommendation engines I arrived at the following conclusion: the best way to achieve it is to build a music recommendation engine, but of course, you don't want to create it by yourself because it will need a huge amount of data that I can't provide nor find, so I need to work with a third-party service that already has this data and recommendation engine, and it seems like there are not many companies that offer that service and that'…  ( 3 min )
    Hello, if i can have a minute of your time to answer some question about A.I and its uses in cyber crime hope you will help me with this one.
    submitted by /u/SpoiledCheweez [link] [comments]  ( 1 min )
    ai is making ai jokes about ai with ai voice...
    submitted by /u/Alive_Ad_2882 [link] [comments]
    DeepMind’s new AI "Gato" can do 604 tasks e.g play Atari, control Robots, chat, caption Images, etc
    submitted by /u/qptbook [link] [comments]
    AI vs AI - Two AI Bots Chatting
    submitted by /u/Alive_Ad_2882 [link] [comments]  ( 1 min )
    Microsoft AI Team Proposes Floquet Codes: A New Class Of Quantum Error Correction Codes That Are Well Suited To Topological Qubits
    The Microsoft Azure Quantum program is built on technological advancements that enable quantum computing to scale. Microsoft announced the notion of topological qubits in March 2022, which are qubits that are theoretically more stable than existing ones without sacrificing size or speed. However, developing a general-purpose quantum computer capable of solving industrial-scale issues will necessitate innovation at all levels of the quantum stack, from nanoscale materials to algorithms and applications. Scientists state that quantum states are inherently fragile and are quickly destroyed when a qubit is coupled to its surroundings. This results in noise, developing the quantum computer a challenging task. Error correction, which is also utilized in traditional digital computing, is a critical technology for overcoming this fragility. Quantum error correction (QEC) can detect and fix most faults that occur on physical qubits by encoding the state of a single logical qubit into multiple physical qubits. However, depending on the quality of the physical qubits, error correction might raise a computation’s space requirements by a factor of thousands. In addition, its time requirements by more than tenfold. As a result, any improvements in mistake correction have a huge positive impact throughout the stack. Continue Reading Paper: https://arxiv.org/pdf/2202.11829.pdf submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
  • Open

    Sampling with replacement until you’ve seen everything
    Suppose you have a standard deck of 52 cards. You pull out a card, put it back in the deck, shuffle, and pull out another card. How long would you expect to do this until you’ve seen every card? Here’s a variation on the same problem. Suppose you’re a park ranger keeping data on tagged […] Sampling with replacement until you’ve seen everything first appeared on John D. Cook.  ( 3 min )
  • Open

    Profiling Python Code
    Profiling is a technique to figure out how time is spent in a program. With these statistics, we can find the “hot spot” of a program and think about ways of improvement. Sometimes, a hot spot in an unexpected location may hint at a bug in the program as well. In this tutorial, we will […] The post Profiling Python Code appeared first on Machine Learning Mastery.  ( 18 min )

  • Open

    Last Week in AI: FDA Clearances, Firing at Google AI, AI for Apple Watch, AI Reviews Beer and Wine
    submitted by /u/regalalgorithm [link] [comments]
    The most effective method to befuddle AI models
    submitted by /u/p0goniphaft111 [link] [comments]
    amazing
    submitted by /u/Murat1_31 [link] [comments]
    AI News | New Intel Gaudi2 & Greco AI Deep Learning Processors | AI Predicts Crohn Disease | AI & Ocean Waves
    submitted by /u/SlightSituation [link] [comments]
    Rendering 3D objects using differentiable SDFs
    submitted by /u/imapurplemango [link] [comments]  ( 1 min )
    Gato: A single Transformer to RuLe them all! (Deepmind's new model)
    submitted by /u/OnlyProggingForFun [link] [comments]
    China Says It’s 3D Printing a 590-Foot Hydroelectric Dam With Zero Human Labor
    submitted by /u/estasfuera [link] [comments]
    Intel’s Habana Labs Introduces Second-Generation AI Deep Learning Processors
    https://preview.redd.it/8ftfke16f7z81.jpg?width=8688&format=pjpg&auto=webp&s=4429ed198c9d0adc06cd0e70a5dda54a1bc070d8 Today, Intel’s Habana Labs team announced two critical new products: Gaudi2, the 2nd iteration of the Gaudi DL training processor, and Greco, the successor to the Goya DL inference processor. Intel’s CPUs are significantly faster than their predecessors and competitors. While Gaudi and Goya are the first new chips launched by Habana Labs following its acquisition by Intel. Gaudi2 and Greco switched from a 16nm to a 7nm technology (via TSMC, the manufacturer). The 10 Tensor processing cores present in the Gaudi training processor have been increased to 24. Also, the in-package memory capacity has tripled from 32GB (HBM2) to 96GB (HBM2E). The onboard SRAM has been increased from 24MB to 48MB. Gaudi2 is the first and only accelerator with such an amount of memory. The CPU has a TDP of 600W (vs. Gaudi – 350W), although it is claimed that it still employs passive cooling and does not require liquid cooling. Continue Reading submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Meta AI Introduces ‘Make-A-Scene’: A Deep Generative Technique Based On An Autoregressive Transformer For Text-To-Image Synthesis With Human Priors
    In recent years, the research related to text-to-image generation has been growing exponentially. Nevertheless, the current methods still lack at least three essential characteristics. First of all, most models accept as input solely the text information. This is a massive limitation, as the controllability of the model is limited to style or color, but it can not be extended to structure or form, for example. The second limitation is related to human perception: indeed, the final aim of these models is to match human perception and attention, but, in reality, the generation process does not include any relevant prior knowledge on this. For example, the losses which control the generation are usually applied to the whole image without adding a specific focus on parts fundamental to human perception (such as human faces, animals, or salient objects). The last missing characteristic is the always-present problem of quality and resolution, as most of the works are limited to an output resolution of 256×256. For these reasons, the team of Facebook AI has introduced Make-A-Scene. This novel method successfully tackled these three gaps while attaining SOTA results for text-to-image generation. The proposed model is essentially three encoders with discrete tokens, an auto-regressive transformer that learns to generate sequences of tokens conditioned on the scene segmentation and a decoder that generates images from this transformer-generated sequence. It is important to note that the network does not use the segmented scene for computing the loss, thus, the segmentation is not necessary at inference time. The model is resumed in the figure below. Continue Reading This Summary Paper: https://arxiv.org/pdf/2203.13131v1.pdf submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    BlobGAN: A GAN model that uses simple blobs to manipulate objects in images
    submitted by /u/OnlyProggingForFun [link] [comments]
  • Open

    Enhance the caller experience with hints in Amazon Lex
    We understand speech input better if we have some background on the topic of conversation. Consider a customer service agent at an auto parts wholesaler helping with orders. If the agent knows that the customer is looking for tires, they’re more likely to recognize responses (for example, “Michelin”) on the phone. Agents often pick up […]  ( 5 min )
    Run automatic model tuning with Amazon SageMaker JumpStart
    In December 2020, AWS announced the general availability of Amazon SageMaker JumpStart, a capability of Amazon SageMaker that helps you quickly and easily get started with machine learning (ML). In March 2022, we also announced the support for APIs in JumpStart. JumpStart provides one-click fine-tuning and deployment of a wide variety of pre-trained models across […]  ( 7 min )
  • Open

    [R] Scaling tasks during pre-training may be > Scaling Model Parameters
    https://arxiv.org/pdf/2201.06910.pdf ...This leads to a crucial discovery that task scaling can be an efficient alternative to model scaling; i.e., the model size has little impact on performance with an extremely large number of tasks. Our results show that task scaling can substantially improve training efficiency by 30 times in FLOPs.. tl;dr scale the amount of tasks as well as data, compute, hyperparameters, FLOPs and prayers for effectively training LLMs. submitted by /u/Competitive-Rub-1958 [link] [comments]  ( 1 min )
    [Discussion] We tried learning AI from games. How about learning from players?
    I wrote a post arguing that video games are more relevant than ever for AI research. Essentially, RL is at an impasse, and all the really impressive progress comes from self-supervised learning. Could we learn behavior foundation models from millions of traces of real humans playing real games, and would this get us more general and more real intelligence? https://modl.ai/learning-ai-from-players/ submitted by /u/togelius [link] [comments]  ( 2 min )
    [P] I was tired of screenshotting plots in Jupyter to share my results. Wanted something better, information rich. So I built a new %%share magic that freezes a cell, captures its code, output & data and returns a URL for sharing.
    ​ https://reddit.com/link/uosqgm/video/pxk7h4jb49z81/player You can try it out in Colab here: https://colab.research.google.com/drive/1E5oU6TjH6OocmvEfU-foJfvCTbTfQrqd?usp=sharing#scrollTo=cVxS_6rBmLKW To install: pip install thousandwords Then in Jupyter Notebook: from thousandwords import share Then: %%share # Your Python code goes here.. More details: https://docs.1000words-hq.com/docs/python-sdk/share Source: https://github.com/edouard-g/thousandwords Homepage: https://1000words-hq.com ------------------------------- EDIT: Thanks for upvotes and the feedback. People have voiced their concerns of inadvertent data leaks, and that the Python package wasn't doing enough to warn the user ahead of time. As a short-term mitigation, I've pushed an update. The %%share magic now warns the user about exactly what gets shared and requires manual confirmation (details below). We'll be looking into building an option to share privately. Feel free to ping me for questions/concerns. More details on the mitigation: from thousandwords import share x = 1 Then: In [3]: %%share ...: print(x) This will upload 'x' server-side. Anyone with the link will have read access. Do you wish to proceed ? [y/N] ​ submitted by /u/Left_Ad8361 [link] [comments]  ( 5 min )
    [R] Clinical Prompt Learning with Frozen Language Models
    Hi everyone! We are a team of computational and clinical scientists for Chronosig Project, Oxford. We are excited to share our work on prompt learning for clinical support tasks. A major concern in the field of NLP is that even the largest pre-trained language models, such as GPT-3, do not perform well in specialized domains such as clinical texts. Typically adapting large PLMs to new domains requires fine-tuning entire models on new texts and downstream tasks, which can require entire fleets of high-end GPUs. Prompt learning is a new paradigm of NLP to fully utilize the pretrained language model based on its pretraining task and can require substantially fewer training parameters given a new task. In this work, we have investigated the feasibility of using prompt learning with small or medium domain trained frozen language models versus "a traditional" fine-tuning on clinical tasks using MIMIC-III data with full-data and few-shot settings. We also develop a new clinical triage task based on the ICD-9 codes, available for future use. We demonstrated that prompt learning requires fewer tuned parameters to match and even outperform traditional fine-tuning in few-shot settings. We actually observed prompt learning could match the performance of traditional fine-tuning with 1000 times fewer trained parameters. This is particularly encouraging for high-stake domains, such as medicine, where high-quality annotated data is scarce. We hope prompt tuning can offer a competitive alternative to a fine-tuning of large language models in the medical domain and act as a framework moving forward. Arxiv: https://arxiv.org/abs/2205.05535 GitHub: https://github.com/NtaylorOX/Public_Clinical_Prompt EDIT: For someone who may need TL;DR: we are creating a blog post, and we hope it will be released by next week! ​ ​ An illustration of the workflow of CPL submitted by /u/YiTsubasa [link] [comments]  ( 2 min )
    [D] ZIP models as a means to handle regression on data with excess of zeros
    Hi Everyone, Sharing an article on how to handle regression for data which has lots and lots of zeros. VevestaX/ZIP_tutorial.md at main · Vevesta/VevestaX · GitHub This article has been written based on my personal experience. While using regression on this peculiar problem on kaggle I was confounded by excessive 0 in the dependent variable. The reason why models like regression and Poisson failed wasn't immediately clear to me. On digging I found less known models - Hurdle and ZIP models are specially fine-tuned for such a problem. Happy to hear feedback on the article. Please let me know if you know other techniques to handle this problem of excessive 0 in dependent variable. submitted by /u/vevesta [link] [comments]  ( 1 min )
    [D] System Requirement to host a finetuned GPT Neo X 20B model
    GPT 3 Davinci with 175B parameters costs $0.0600 / 1K tokens Their cheaper model Curie with 13B parameters costs $0.0060 / 1K tokens ​ GPT Neo X has 20B parameters... ​ But on so many sites, GPT Neo X with 20B parameters cost more than GPT 3 Davinci with 175B parameters... ​ For example on NLPCloud, GPT Neo X costs $0.095/1K tokens. ​ and on Goose AI, it costs 0.063 / 1K tokens ​ Quality difference between Davinci and Neo X is huge, but why price is higher than Davinci? ​ So I was thinking about hosting it on my own custom server. ​ I know finetuning it gonna require a lot of resources but it's only one time thing. So I don't really mind it that much... But can you please give me any estimated information about how much is it gonna cost or what's system requirements to host a already finetuned Neo X model on a server? Btw my usage is to process around 200,000 tokens per day ( ~1000-2000 requests ) submitted by /u/Homeless_Programmer [link] [comments]  ( 1 min )
    [D] word misleading classification model
    In my text classification a particular word misleading the model. But these words are very high in the training data for a particular lable. Eg: i have a training data contains " lost my phone", "changed my phone", .... All these labels are belongs to " problem with telephone" class. Now, I am using Universal sentence encoder to build the model. During inference if i have given some random sentences and put the word " phone" in the middle, But still my model is predicting "problem with telephone" class with higher confidence. How should we handle these situation. submitted by /u/NSVR57 [link] [comments]  ( 1 min )
    [D] Anomaly detection - deviation from normal
    Hi everyone, I am collecting temperature data from a sensor measuring the temperature of a process. The temperature raises to go in steady-state during operation and it cools down when the process is stopped. I would like to set a simple threshold notifying me when the temperature during operation is higher or lower than expected, how should I tackle the following: Filter temperatures when the process is warming up (non-steady state) How much data should I use to define the threshold? More like a moving window (median, MAD, ...) or a mean reversion on all historical data? I am measuring the room temeprature with another sensor, how do I remove seasonality in your opinion with the second sensor's temperature data? (substraction, division, ...) How to update the thresholds with new data collected? Thank you all. submitted by /u/No-Way3852 [link] [comments]  ( 1 min )
  • Open

    Startups using Reinforcement Learning
    What are the upcoming startups that are based on RL? I know covariant.ai and kindred.ai use RL and are there any other RL-based startups submitted by /u/blitzkreig3 [link] [comments]
    Flattening the layers that preprocess the observation space
    In this paper, the observation space is a dictionary that has four keys: - Image - Direction of the agent - Position of the agent - Recurrent state of the policy From line 237 to line 248 (https://github.com/google-research/google-research/blob/98b89bd2d67c7944a8ac381e0417d4c20a6c87ee/social_rl/multiagent_tfagents/joint_attention/attention_networks.py#L237), a few preprocessing layers to process those four tensors of the obs space are built. What I don't understand is what happens after that: preprocessing_nest = tf.nest.map_structure(lambda l: None, preprocessing_layers) flat_observation_spec = nest_utils.flatten_up_to( preprocessing_nest, observation_spec, ) If I understand correctly, a lambda None function is applied to each of those preprocessing layers. This is then flattened up to the observation space, to retain some structure. I don't understand what this is exactly doing. Can someone explain? submitted by /u/No_Possibility_7588 [link] [comments]  ( 1 min )
    Which learning algorithms count as "Deep RL"?
    Deep Q-Networks are of course Deep RL. Is PPO another example? How about A2C/A3C/SAC? I don't know enough about these algorithms yet to decide for myself and I'm struggling to find adequate information online. Thanks in advance! submitted by /u/C_BearHill [link] [comments]  ( 1 min )
    In how many steps should we update our networks in DDPG?
    Correct me if I am wrong but in the original DDPG paper, I saw that update for actor and critic networks was inside the step loop (of an episode). But this will very expensive computationally. So, in how many steps or global_steps should I update my models? submitted by /u/Better-Ad8608 [link] [comments]  ( 1 min )
    Q-Learning Example Tutorial (w/ Q-table & Bellman equation)
    submitted by /u/lukenewmann1 [link] [comments]
    Cannot find where’s the in-place operation error.
    I'm trying to train the MADDPG model; however, it has occurred an in-place operation error. Here's the traceback, and I've taken some excerpts that I think are critical: ​ RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [16, 2]], which is output 0 of AsStridedBackward0, is at version 3; expected version 2 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck! ​ [W ..\torch\csrc\autograd\python_anomaly_mode.cpp:104] Warning: Error detected in AddmmBackward0. Traceback of forward call that caused the error: ... c_loss, a_loss = all_agents.learn(memory) File "C:\Users\chhu…  ( 2 min )
    Gato: A single Transformer to RuLe them all! (Deepmind's new model)
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 1 min )
    Need some help for recommendation system
    Hello. As a part of a short internship (around a month long), I am required to create a recommendation system for a eCommerce. I was planning to build some intuition how state transfer will take place, how state will look. Can you help me in deciding environment, state, reward, policy for eCommerce setting. Dataset - https://tianchi.aliyun.com/competition/entrance/231721/information?lang=en-us submitted by /u/kachua26 [link] [comments]  ( 2 min )
    "Searching for Efficient Neural Architectures for On-Device ML on Edge TPUs", Akin et al 2022 {G}
    submitted by /u/gwern [link] [comments]  ( 1 min )
  • Open

    New AI Deep Learning Processors | Intel's Gaudi2 & Greco
    submitted by /u/getrich_or_diemining [link] [comments]
  • Open

    Broom, Broom: WeRide Revs Up Self-Driving Street Sweepers Powered by NVIDIA
    When it comes to safety, efficiency and sustainability, autonomous vehicles are delivering a clean sweep. Autonomous vehicle company and NVIDIA Inception member WeRide this month began a public road pilot of its Robo Street Sweepers. The vehicles, designed to perform round-the-clock cleaning services, are built on the high-performance, energy-efficient compute of NVIDIA. The fleet of Read article > The post Broom, Broom: WeRide Revs Up Self-Driving Street Sweepers Powered by NVIDIA appeared first on NVIDIA Blog.  ( 2 min )
  • Open

    A Beginners Guide to RPA ‘ Bots’
    Enterprises burdened with the processes that are repetitive, time-consuming & mundane are constantly on the lookout for new and innovative…  ( 4 min )
  • Open

    How Does Knowledge Graph Embedding Extrapolate to Unseen Data: A Semantic Evidence View. (arXiv:2109.11800v3 [cs.CL] UPDATED)
    Knowledge Graph Embedding (KGE) aims to learn representations for entities and relations. Most KGE models have gained great success, especially on extrapolation scenarios. Specifically, given an unseen triple (h, r, t), a trained model can still correctly predict t from (h, r, ?), or h from (?, r, t), such extrapolation ability is impressive. However, most existing KGE works focus on the design of delicate triple modeling function, which mainly tells us how to measure the plausibility of observed triples, but offers limited explanation of why the methods can extrapolate to unseen data, and what are the important factors to help KGE extrapolate. Therefore in this work, we attempt to study the KGE extrapolation of two problems: 1. How does KGE extrapolate to unseen data? 2. How to design the KGE model with better extrapolation ability? For the problem 1, we first discuss the impact factors for extrapolation and from relation, entity and triple level respectively, propose three Semantic Evidences (SEs), which can be observed from train set and provide important semantic information for extrapolation. Then we verify the effectiveness of SEs through extensive experiments on several typical KGE methods. For the problem 2, to make better use of the three levels of SE, we propose a novel GNN-based KGE model, called Semantic Evidence aware Graph Neural Network (SE-GNN). In SE-GNN, each level of SE is modeled explicitly by the corresponding neighbor pattern, and merged sufficiently by the multi-layer aggregation, which contributes to obtaining more extrapolative knowledge representation. Finally, through extensive experiments on FB15k-237 and WN18RR datasets, we show that SE-GNN achieves state-of-the-art performance on Knowledge Graph Completion task and performs a better extrapolation ability. Our code is available at https://github.com/renli1024/SE-GNN.  ( 3 min )
    Positive, Negative and Neutral: Modeling Implicit Feedback in Session-based News Recommendation. (arXiv:2205.06058v1 [cs.IR])
    News recommendation for anonymous readers is a useful but challenging task for many news portals, where interactions between readers and articles are limited within a temporary login session. Previous works tend to formulate session-based recommendation as a next item prediction task, while they neglect the implicit feedback from user behaviors, which indicates what users really like or dislike. Hence, we propose a comprehensive framework to model user behaviors through positive feedback (i.e., the articles they spend more time on) and negative feedback (i.e., the articles they choose to skip without clicking in). Moreover, the framework implicitly models the user using their session start time, and the article using its initial publishing time, in what we call "neutral feedback". Empirical evaluation on three real-world news datasets shows the framework's promising performance of more accurate, diverse and even unexpectedness recommendations than other state-of-the-art session-based recommendation approaches.  ( 2 min )
    RLOC: Terrain-Aware Legged Locomotion using Reinforcement Learning and Optimal Control. (arXiv:2012.03094v3 [cs.RO] UPDATED)
    We present a unified model-based and data-driven approach for quadrupedal planning and control to achieve dynamic locomotion over uneven terrain. We utilize on-board proprioceptive and exteroceptive feedback to map sensory information and desired base velocity commands into footstep plans using a reinforcement learning (RL) policy. This RL policy is trained in simulation over a wide range of procedurally generated terrains. When ran online, the system tracks the generated footstep plans using a model-based motion controller. We evaluate the robustness of our method over a wide variety of complex terrains. It exhibits behaviors which prioritize stability over aggressive locomotion. Additionally, we introduce two ancillary RL policies for corrective whole-body motion tracking and recovery control. These policies account for changes in physical parameters and external perturbations. We train and evaluate our framework on a complex quadrupedal system, ANYmal version B, and demonstrate transferability to a larger and heavier robot, ANYmal C, without requiring retraining.  ( 2 min )
    A BIC-based Mixture Model Defense against Data Poisoning Attacks on Classifiers. (arXiv:2105.13530v2 [cs.LG] UPDATED)
    Data Poisoning (DP) is an effective attack that causes trained classifiers to misclassify their inputs. DP attacks significantly degrade a classifier's accuracy by covertly injecting attack samples into the training set. Broadly applicable to different classifier structures, without strong assumptions about the attacker, an {\it unsupervised} Bayesian Information Criterion (BIC)-based mixture model defense against "error generic" DP attacks is herein proposed that: 1) addresses the most challenging {\it embedded} DP scenario wherein, if DP is present, the poisoned samples are an {\it a priori} unknown subset of the training set, and with no clean validation set available; 2) applies a mixture model both to well-fit potentially multi-modal class distributions and to capture poisoned samples within a small subset of the mixture components; 3) jointly identifies poisoned components and samples by minimizing the BIC cost defined over the whole training set, with the identified poisoned data removed prior to classifier training. Our experimental results, for various classifier structures and benchmark datasets, demonstrate the effectiveness and universality of our defense under strong DP attacks, as well as its superiority over other works.  ( 2 min )
    Ensemble Clustering via Co-association Matrix Self-enhancement. (arXiv:2205.05937v1 [cs.LG])
    Ensemble clustering integrates a set of base clustering results to generate a stronger one. Existing methods usually rely on a co-association (CA) matrix that measures how many times two samples are grouped into the same cluster according to the base clusterings to achieve ensemble clustering. However, when the constructed CA matrix is of low quality, the performance will degrade. In this paper, we propose a simple yet effective CA matrix self-enhancement framework that can improve the CA matrix to achieve better clustering performance. Specifically, we first extract the high-confidence (HC) information from the base clusterings to form a sparse HC matrix. By propagating the highly-reliable information of the HC matrix to the CA matrix and complementing the HC matrix according to the CA matrix simultaneously, the proposed method generates an enhanced CA matrix for better clustering. Technically, the proposed model is formulated as a symmetric constrained convex optimization problem, which is efficiently solved by an alternating iterative algorithm with convergence and global optimum theoretically guaranteed. Extensive experimental comparisons with twelve state-of-the-art methods on eight benchmark datasets substantiate the effectiveness, flexibility and efficiency of the proposed model in ensemble clustering. The codes and datasets can be downloaded at https://github.com/Siritao/EC-CMS.  ( 2 min )
    Contingency-constrained economic dispatch with safe reinforcement learning. (arXiv:2205.06212v1 [eess.SY])
    Future power systems will rely heavily on micro grids with a high share of decentralised renewable energy sources and energy storage systems. The high complexity and uncertainty in this context might make conventional power dispatch strategies infeasible. Reinforcement-learning based (RL) controllers can address this challenge, however, cannot themselves provide safety guarantees, preventing their deployment in practice. To overcome this limitation, we propose a formally validated RL controller for economic dispatch. We extend conventional constraints by a time-dependent constraint encoding the islanding contingency. The contingency constraint is computed using set-based backwards reachability analysis and actions of the RL agent are verified through a safety layer. Unsafe actions are projected into the safe action space while leveraging constrained zonotope set representations for computational efficiency. The developed approach is demonstrated on a residential use case using real-world measurements.  ( 2 min )
    Maximum sampled conditional likelihood for informative subsampling. (arXiv:2011.05988v3 [math.ST] UPDATED)
    Subsampling is a computationally effective approach to extract information from massive data sets when computing resources are limited. After a subsample is taken from the full data, most available methods use an inverse probability weighted (IPW) objective function to estimate the model parameters. The IPW estimator does not fully utilize the information in the selected subsample. In this paper, we propose to use the maximum sampled conditional likelihood estimator (MSCLE) based on the sampled data. We established the asymptotic normality of the MSCLE and prove that its asymptotic variance covariance matrix is the smallest among a class of asymptotically unbiased estimators, including the IPW estimator. We further discuss the asymptotic results with the L-optimal subsampling probabilities and illustrate the estimation procedure with generalized linear models. Numerical experiments are provided to evaluate the practical performance of the proposed method.  ( 2 min )
    Solving ImageNet: a Unified Scheme for Training any Backbone to Top Results. (arXiv:2204.03475v2 [cs.CV] UPDATED)
    ImageNet serves as the primary dataset for evaluating the quality of computer-vision models. The common practice today is training each architecture with a tailor-made scheme, designed and tuned by an expert. In this paper, we present a unified scheme for training any backbone on ImageNet. The scheme, named USI (Unified Scheme for ImageNet), is based on knowledge distillation and modern tricks. It requires no adjustments or hyper-parameters tuning between different models, and is efficient in terms of training times. We test USI on a wide variety of architectures, including CNNs, Transformers, Mobile-oriented and MLP-only. On all models tested, USI outperforms previous state-of-the-art results. Hence, we are able to transform training on ImageNet from an expert-oriented task to an automatic seamless routine. Since USI accepts any backbone and trains it to top results, it also enables to perform methodical comparisons, and identify the most efficient backbones along the speed-accuracy Pareto curve. Implementation is available at:https://github.com/Alibaba-MIIL/Solving_ImageNet  ( 2 min )
    Zero-shot Code-Mixed Offensive Span Identification through Rationale Extraction. (arXiv:2205.06119v1 [cs.CL])
    This paper investigates the effectiveness of sentence-level transformers for zero-shot offensive span identification on a code-mixed Tamil dataset. More specifically, we evaluate rationale extraction methods of Local Interpretable Model Agnostic Explanations (LIME) \cite{DBLP:conf/kdd/Ribeiro0G16} and Integrated Gradients (IG) \cite{DBLP:conf/icml/SundararajanTY17} for adapting transformer based offensive language classification models for zero-shot offensive span identification. To this end, we find that LIME and IG show baseline $F_{1}$ of 26.35\% and 44.83\%, respectively. Besides, we study the effect of data set size and training process on the overall accuracy of span identification. As a result, we find both LIME and IG to show significant improvement with Masked Data Augmentation and Multilabel Training, with $F_{1}$ of 50.23\% and 47.38\% respectively. \textit{Disclaimer : This paper contains examples that may be considered profane, vulgar, or offensive. The examples do not represent the views of the authors or their employers/graduate schools towards any person(s), group(s), practice(s), or entity/entities. Instead they are used to emphasize only the linguistic research challenges.}  ( 2 min )
    How I failed machine learning in medical imaging -- shortcomings and recommendations. (arXiv:2103.10292v2 [eess.IV] UPDATED)
    Medical imaging is an important research field with many opportunities for improving patients' health. However, there are a number of challenges that are slowing down the progress of the field as a whole, such optimizing for publication. In this paper we reviewed several problems related to choosing datasets, methods, evaluation metrics, and publication strategies. With a review of literature and our own analysis, we show that at every step, potential biases can creep in. On a positive note, we also see that initiatives to counteract these problems are already being started. Finally we provide a broad range of recommendations on how to further these address problems in the future. For reproducibility, data and code for our analyses are available on \url{https://github.com/GaelVaroquaux/ml_med_imaging_failures}  ( 2 min )
    Framework for inferring empirical causal graphs from binary data to support multidimensional poverty analysis. (arXiv:2205.06131v1 [stat.ME])
    Poverty is one of the fundamental issues that mankind faces. Multidimensional Poverty Index (MPI) is deployed for measuring poverty issues in a population beyond monetary. However, MPI cannot provide information regarding associations and causal relations among poverty factors. Does education cause income inequality in a specific region? Is lacking education a cause of health issues? By not knowing causal relations, policy maker cannot pinpoint root causes of poverty issues of a specific population, which might not be the same across different population. Additionally, MPI requires binary data, which cannot be analyzed by most of causal inference frameworks. In this work, we proposed an exploratory-data-analysis framework for finding possible causal relations with confidence intervals among binary data. The proposed framework provides not only how severe the issue of poverty is, but it also provides the causal relations among poverty factors. Moreover, knowing a confidence interval of degree of causal direction lets us know how strong a causal relation is. We evaluated the proposed framework with several baseline approaches in simulation datasets as well as using two real-world datasets as case studies 1) Twin births of the United States: the relation between birth weight and mortality of twin, and 2) Thailand population surveys from 378k households of Chiang Mai and 353k households of Khon Kaen provinces. Our framework performed better than baselines in most cases. The first case study reveals almost all mortality cases in twins have issues of low birth weights but not all low-birth-weight twins were died. The second case study reveals that smoking associates with drinking alcohol in both provinces and there is a causal relation of smoking causes drinking alcohol in only Chiang Mai province. The framework can be applied beyond the poverty context.  ( 3 min )
    Machine Learning Workflow to Explain Black-box Models for Early Alzheimer's Disease Classification Evaluated for Multiple Datasets. (arXiv:2205.05907v1 [cs.LG])
    Purpose: Hard-to-interpret Black-box Machine Learning (ML) were often used for early Alzheimer's Disease (AD) detection. Methods: To interpret eXtreme Gradient Boosting (XGBoost), Random Forest (RF), and Support Vector Machine (SVM) black-box models a workflow based on Shapley values was developed. All models were trained on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset and evaluated for an independent ADNI test set, as well as the external Australian Imaging and Lifestyle flagship study of Ageing (AIBL), and Open Access Series of Imaging Studies (OASIS) datasets. Shapley values were compared to intuitively interpretable Decision Trees (DTs), and Logistic Regression (LR), as well as natural and permutation feature importances. To avoid the reduction of the explanation validity caused by correlated features, forward selection and aspect consolidation were implemented. Results: Some black-box models outperformed DTs and LR. The forward-selected features correspond to brain areas previously associated with AD. Shapley values identified biologically plausible associations with moderate to strong correlations with feature importances. The most important RF features to predict AD conversion were the volume of the amygdalae, and a cognitive test score. Good cognitive test performances and large brain volumes decreased the AD risk. The models trained using cognitive test scores significantly outperformed brain volumetric models ($p<0.05$). Cognitive Normal (CN) vs. AD models were successfully transferred to external datasets. Conclusion: In comparison to previous work, improved performances for ADNI and AIBL were achieved for CN vs. Mild Cognitive Impairment (MCI) classification using brain volumes. The Shapley values and the feature importances showed moderate to strong correlations.  ( 2 min )
    Sample Complexity Bounds for Robustly Learning Decision Lists against Evasion Attacks. (arXiv:2205.06127v1 [cs.LG])
    A fundamental problem in adversarial machine learning is to quantify how much training data is needed in the presence of evasion attacks. In this paper we address this issue within the framework of PAC learning, focusing on the class of decision lists. Given that distributional assumptions are essential in the adversarial setting, we work with probability distributions on the input data that satisfy a Lipschitz condition: nearby points have similar probability. Our key results illustrate that the adversary's budget (that is, the number of bits it can perturb on each input) is a fundamental quantity in determining the sample complexity of robust learning. Our first main result is a sample-complexity lower bound: the class of monotone conjunctions (essentially the simplest non-trivial hypothesis class on the Boolean hypercube) and any superclass has sample complexity at least exponential in the adversary's budget. Our second main result is a corresponding upper bound: for every fixed $k$ the class of $k$-decision lists has polynomial sample complexity against a $\log(n)$-bounded adversary. This sheds further light on the question of whether an efficient PAC learning algorithm can always be used as an efficient $\log(n)$-robust learning algorithm under the uniform distribution.  ( 2 min )
    Over-the-Air Federated Learning with Joint Adaptive Computation and Power Control. (arXiv:2205.05867v1 [cs.LG])
    This paper considers over-the-air federated learning (OTA-FL). OTA-FL exploits the superposition property of the wireless medium, and performs model aggregation over the air for free. Thus, it can greatly reduce the communication cost incurred in communicating model updates from the edge devices. In order to fully utilize this advantage while providing comparable learning performance to conventional federated learning that presumes model aggregation via noiseless channels, we consider the joint design of transmission scaling and the number of local iterations at each round, given the power constraint at each edge device. We first characterize the training error due to such channel noise in OTA-FL by establishing a fundamental lower bound for general functions with Lipschitz-continuous gradients. Then, by introducing an adaptive transceiver power scaling scheme, we propose an over-the-air federated learning algorithm with joint adaptive computation and power control (ACPC-OTA-FL). We provide the convergence analysis for ACPC-OTA-FL in training with non-convex objective functions and heterogeneous data. We show that the convergence rate of ACPC-OTA-FL matches that of FL with noise-free communications.  ( 2 min )
    Ethereum Fraud Detection with Heterogeneous Graph Neural Networks. (arXiv:2203.12363v2 [cs.LG] UPDATED)
    While transactions with cryptocurrencies such as Ethereum are becoming more prevalent, fraud and other criminal transactions are not uncommon. Graph analysis algorithms and machine learning techniques detect suspicious transactions that lead to phishing in large transaction networks. Many graph neural network (GNN) models have been proposed to apply deep learning techniques to graph structures. Although there is research on phishing detection using GNN models in the Ethereum transaction network, models that address the scale of the number of vertices and edges and the imbalance of labels have not yet been studied. In this paper, we compared the model performance of GNN models on the actual Ethereum transaction network dataset and phishing reported label data to exhaustively compare and verify which GNN models and hyperparameters produce the best accuracy. Specifically, we evaluated the model performance of representative homogeneous GNN models which consider single-type nodes and edges and heterogeneous GNN models which support different types of nodes and edges. We showed that heterogeneous models had better model performance than homogeneous models. In particular, the RGCN model achieved the best performance in the overall metrics.  ( 2 min )
    Anomaly Detection of Adversarial Examples using Class-conditional Generative Adversarial Networks. (arXiv:2105.10101v2 [cs.LG] UPDATED)
    Deep Neural Networks (DNNs) have been shown vulnerable to Test-Time Evasion attacks (TTEs, or adversarial examples), which, by making small changes to the input, alter the DNN's decision. We propose an unsupervised attack detector on DNN classifiers based on class-conditional Generative Adversarial Networks (GANs). We model the distribution of clean data conditioned on the predicted class label by an Auxiliary Classifier GAN (AC-GAN). Given a test sample and its predicted class, three detection statistics are calculated based on the AC-GAN Generator and Discriminator. Experiments on image classification datasets under various TTE attacks show that our method outperforms previous detection methods. We also investigate the effectiveness of anomaly detection using different DNN layers (input features or internal-layer features) and demonstrate, as one might expect, that anomalies are harder to detect using features closer to the DNN's output layer.  ( 2 min )
    Low-variance estimation in the Plackett-Luce model via quasi-Monte Carlo sampling. (arXiv:2205.06024v1 [stat.ML])
    The Plackett-Luce (PL) model is ubiquitous in learning-to-rank (LTR) because it provides a useful and intuitive probabilistic model for sampling ranked lists. Counterfactual offline evaluation and optimization of ranking metrics are pivotal for using LTR methods in production. When adopting the PL model as a ranking policy, both tasks require the computation of expectations with respect to the model. These are usually approximated via Monte-Carlo (MC) sampling, since the combinatorial scaling in the number of items to be ranked makes their analytical computation intractable. Despite recent advances in improving the computational efficiency of the sampling process via the Gumbel top-k trick, the MC estimates can suffer from high variance. We develop a novel approach to producing more sample-efficient estimators of expectations in the PL model by combining the Gumbel top-k trick with quasi-Monte Carlo (QMC) sampling, a well-established technique for variance reduction. We illustrate our findings both theoretically and empirically using real-world recommendation data from Amazon Music and the Yahoo learning-to-rank challenge.  ( 2 min )
    Multimodal Indoor Localisation for Measuring Mobility in Parkinson's Disease using Transformers. (arXiv:2205.06142v1 [cs.LG])
    Parkinson's disease (PD) is a slowly progressive debilitating neurodegenerative disease which is prominently characterised by motor symptoms. Indoor localisation, including number and speed of room to room transitions, provides a proxy outcome which represents mobility and could be used as a digital biomarker to quantify how mobility changes as this disease progresses. We use data collected from 10 people with Parkinson's, and 10 controls, each of whom lived for five days in a smart home with various sensors. In order to more effectively localise them indoors, we propose a transformer-based approach utilizing two data modalities, Received Signal Strength Indicator (RSSI) and accelerometer data from wearable devices, which provide complementary views of movement. Our approach makes asymmetric and dynamic correlations by a) learning temporal correlations at different scales and levels, and b) utilizing various gating mechanisms to select relevant features within modality and suppress unnecessary modalities. On a dataset with real patients, we demonstrate that our proposed method gives an average accuracy of 89.9%, outperforming competitors. We also show that our model is able to better predict in-home mobility for people with Parkinson's with an average offset of 1.13 seconds to ground truth.  ( 2 min )
    Improved Meta Learning for Low Resource Speech Recognition. (arXiv:2205.06182v1 [cs.CL])
    We propose a new meta learning based framework for low resource speech recognition that improves the previous model agnostic meta learning (MAML) approach. The MAML is a simple yet powerful meta learning approach. However, the MAML presents some core deficiencies such as training instabilities and slower convergence speed. To address these issues, we adopt multi-step loss (MSL). The MSL aims to calculate losses at every step of the inner loop of MAML and then combines them with a weighted importance vector. The importance vector ensures that the loss at the last step has more importance than the previous steps. Our empirical evaluation shows that MSL significantly improves the stability of the training procedure and it thus also improves the accuracy of the overall system. Our proposed system outperforms MAML based low resource ASR system on various languages in terms of character error rates and stable training behavior.  ( 2 min )
    SpecRepair: Counter-Example Guided Safety Repair of Deep Neural Networks. (arXiv:2106.01917v5 [cs.LG] UPDATED)
    Deep neural networks (DNNs) are increasingly applied in safety-critical domains, such as self-driving cars, unmanned aircraft, and medical diagnosis. It is of fundamental importance to certify the safety of these DNNs, i.e. that they comply with a formal safety specification. While safety certification tools exactly answer this question, they are of no help in debugging unsafe DNNs, requiring the developer to iteratively verify and modify the DNN until safety is eventually achieved. Hence, a repair technique needs to be developed that can produce a safe DNN automatically. To address this need, we present SpecRepair, a tool that efficiently eliminates counter-examples from a DNN and produces a provably safe DNN without harming its classification accuracy. SpecRepair combines specification-based counter-example search and resumes training of the DNN, penalizing counter-examples and certifying the resulting DNN. We evaluate SpecRepair's effectiveness on the ACAS Xu benchmark, a DNN-based controller for unmanned aircraft, and two image classification benchmarks. The results show that SpecRepair is more successful in producing safe DNNs than comparable methods, has a shorter runtime, and produces safe DNNs while preserving their classification accuracy.  ( 2 min )
    Ensemble Classifier Design Tuned to Dataset Characteristics for Network Intrusion Detection. (arXiv:2205.06177v1 [cs.CR])
    Machine Learning-based supervised approaches require highly customized and fine-tuned methodologies to deliver outstanding performance. This paper presents a dataset-driven design and performance evaluation of a machine learning classifier for the network intrusion dataset UNSW-NB15. Analysis of the dataset suggests that it suffers from class representation imbalance and class overlap in the feature space. We employed ensemble methods using Balanced Bagging (BB), eXtreme Gradient Boosting (XGBoost), and Random Forest empowered by Hellinger Distance Decision Tree (RF-HDDT). BB and XGBoost are tuned to handle the imbalanced data, and Random Forest (RF) classifier is supplemented by the Hellinger metric to address the imbalance issue. Two new algorithms are proposed to address the class overlap issue in the dataset. These two algorithms are leveraged to help improve the performance of the testing dataset by modifying the final classification decision made by three base classifiers as part of the ensemble classifier which employs a majority vote combiner. The proposed design is evaluated for both binary and multi-category classification. Comparing the proposed model to those reported on the same dataset in the literature demonstrate that the proposed model outperforms others by a significant margin for both binary and multi-category classification cases.
    Segmentation-Consistent Probabilistic Lesion Counting. (arXiv:2204.05276v2 [eess.IV] UPDATED)
    Lesion counts are important indicators of disease severity, patient prognosis, and treatment efficacy, yet counting as a task in medical imaging is often overlooked in favor of segmentation. This work introduces a novel continuously differentiable function that maps lesion segmentation predictions to lesion count probability distributions in a consistent manner. The proposed end-to-end approach--which consists of voxel clustering, lesion-level voxel probability aggregation, and Poisson-binomial counting--is non-parametric and thus offers a robust and consistent way to augment lesion segmentation models with post hoc counting capabilities. Experiments on Gadolinium-enhancing lesion counting demonstrate that our method outputs accurate and well-calibrated count distributions that capture meaningful uncertainty information. They also reveal that our model is suitable for multi-task learning of lesion segmentation, is efficient in low data regimes, and is robust to adversarial attacks.  ( 2 min )
    NER-MQMRC: Formulating Named Entity Recognition as Multi Question Machine Reading Comprehension. (arXiv:2205.05904v1 [cs.LG])
    NER has been traditionally formulated as a sequence labeling task. However, there has been recent trend in posing NER as a machine reading comprehension task (Wang et al., 2020; Mengge et al., 2020), where entity name (or other information) is considered as a question, text as the context and entity value in text as answer snippet. These works consider MRC based on a single question (entity) at a time. We propose posing NER as a multi-question MRC task, where multiple questions (one question per entity) are considered at the same time for a single text. We propose a novel BERT-based multi-question MRC (NER-MQMRC) architecture for this formulation. NER-MQMRC architecture considers all entities as input to BERT for learning token embeddings with self-attention and leverages BERT-based entity representation for further improving these token embeddings for NER task. Evaluation on three NER datasets show that our proposed architecture leads to average 2.5 times faster training and 2.3 times faster inference as compared to NER-SQMRC framework based models by considering all entities together in a single pass. Further, we show that our model performance does not degrade compared to single-question based MRC (NER-SQMRC) (Devlin et al., 2019) leading to F1 gain of +0.41%, +0.32% and +0.27% for AE-Pub, Ecommerce5PT and Twitter datasets respectively. We propose this architecture primarily to solve large scale e-commerce attribute (or entity) extraction from unstructured text of a magnitude of 50k+ attributes to be extracted on a scalable production environment with high performance and optimised training and inference runtimes.  ( 2 min )
    Addressing Census data problems in race imputation via fully Bayesian Improved Surname Geocoding and name supplements. (arXiv:2205.06129v1 [stat.ML])
    Prediction of an individual's race and ethnicity plays an important role in social science and public health research. Examples include studies of racial disparity in health and voting. Recently, Bayesian Improved Surname Geocoding (BISG), which uses Bayes' rule to combine information from Census surname files with the geocoding of an individual's residence, has emerged as a leading methodology for this prediction task. Unfortunately, BISG suffers from two Census data problems that contribute to unsatisfactory predictive performance for minorities. First, the decennial Census often contains zero counts for minority racial groups in the Census blocks where some members of those groups reside. Second, because the Census surname files only include frequent names, many surnames -- especially those of minorities -- are missing from the list. To address the zero counts problem, we introduce a fully Bayesian Improved Surname Geocoding (fBISG) methodology that accounts for potential measurement error in Census counts by extending the na\"ive Bayesian inference of the BISG methodology to full posterior inference. To address the missing surname problem, we supplement the Census surname data with additional data on last, first, and middle names taken from the voter files of six Southern states where self-reported race is available. Our empirical validation shows that the fBISG methodology and name supplements significantly improve the accuracy of race imputation across all racial groups, and especially for Asians. The proposed methodology, together with additional name data, is available via the open-source software package wru.  ( 2 min )
    The Mechanism of Prediction Head in Non-contrastive Self-supervised Learning. (arXiv:2205.06226v1 [cs.LG])
    Recently the surprising discovery of the Bootstrap Your Own Latent (BYOL) method by Grill et al. shows the negative term in contrastive loss can be removed if we add the so-called prediction head to the network. This initiated the research of non-contrastive self-supervised learning. It is mysterious why even when there exist trivial collapsed global optimal solutions, neural networks trained by (stochastic) gradient descent can still learn competitive representations. This phenomenon is a typical example of implicit bias in deep learning and remains little understood. In this work, we present our empirical and theoretical discoveries on non-contrastive self-supervised learning. Empirically, we find that when the prediction head is initialized as an identity matrix with only its off-diagonal entries being trainable, the network can learn competitive representations even though the trivial optima still exist in the training objective. Theoretically, we present a framework to understand the behavior of the trainable, but identity-initialized prediction head. Under a simple setting, we characterized the substitution effect and acceleration effect of the prediction head. The substitution effect happens when learning the stronger features in some neurons can substitute for learning these features in other neurons through updating the prediction head. And the acceleration effect happens when the substituted features can accelerate the learning of other weaker features to prevent them from being ignored. These two effects enable the neural networks to learn all the features rather than focus only on learning the stronger features, which is likely the cause of the dimensional collapse phenomenon. To the best of our knowledge, this is also the first end-to-end optimization guarantee for non-contrastive methods using nonlinear neural networks with a trainable prediction head and normalization.  ( 2 min )
    Gauge equivariant neural networks for quantum lattice gauge theories. (arXiv:2012.05232v2 [cond-mat.str-el] UPDATED)
    Gauge symmetries play a key role in physics appearing in areas such as quantum field theories of the fundamental particles and emergent degrees of freedom in quantum materials. Motivated by the desire to efficiently simulate many-body quantum systems with exact local gauge invariance, gauge equivariant neural-network quantum states are introduced, which exactly satisfy the local Hilbert space constraints necessary for the description of quantum lattice gauge theory with Zd gauge group on different geometries. Focusing on the special case of Z2 gauge group on a periodically identified square lattice, the equivariant architecture is analytically shown to contain the loop-gas solution as a special case. Gauge equivariant neural-network quantum states are used in combination with variational quantum Monte Carlo to obtain compact descriptions of the ground state wavefunction for the Z2 theory away from the exactly solvable limit, and to demonstrate the confining/deconfining phase transition of the Wilson loop order parameter.  ( 2 min )
    ELODI: Ensemble Logit Difference Inhibition for Positive-Congruent Training. (arXiv:2205.06265v1 [cs.LG])
    Negative flips are errors introduced in a classification system when a legacy model is replaced with a new one. Existing methods to reduce the negative flip rate (NFR) either do so at the expense of overall accuracy using model distillation, or use ensembles, which multiply inference cost prohibitively. We present a method to train a classification system that achieves paragon performance in both error rate and NFR, at the inference cost of a single model. Our method introduces a generalized distillation objective, Logit Difference Inhibition (LDI), that penalizes changes in the logits between the new and old model, without forcing them to coincide as in ordinary distillation. LDI affords the model flexibility to reduce error rate along with NFR. The method uses a homogeneous ensemble as the reference model for LDI, hence the name Ensemble LDI, or ELODI. The reference model can then be substituted with a single model at inference time. The method leverages the observation that negative flips are typically not close to the decision boundary, but often exhibit large deviations in the distance among their logits, which are reduced by ELODI.  ( 2 min )
    Learning Generalized Policies Without Supervision Using GNNs. (arXiv:2205.06002v1 [cs.AI])
    We consider the problem of learning generalized policies for classical planning domains using graph neural networks from small instances represented in lifted STRIPS. The problem has been considered before but the proposed neural architectures are complex and the results are often mixed. In this work, we use a simple and general GNN architecture and aim at obtaining crisp experimental results and a deeper understanding: either the policy greedy in the learned value function achieves close to 100% generalization over instances larger than those used in training, or the failure must be understood, and possibly fixed, logically. For this, we exploit the relation established between the expressive power of GNNs and the $C_{2}$ fragment of first-order logic (namely, FOL with 2 variables and counting quantifiers). We find for example that domains with general policies that require more expressive features can be solved with GNNs once the states are extended with suitable "derived atoms" encoding role compositions and transitive closures that do not fit into $C_{2}$. The work follows the GNN approach for learning optimal general policies in a supervised fashion (Stahlberg, Bonet, Geffner, 2022); but the learned policies are no longer required to be optimal (which expands the scope, as many planning domains do not have general optimal policies) and are learned without supervision. Interestingly, value-based reinforcement learning methods that aim to produce optimal policies, do not always yield policies that generalize, as the goals of optimality and generality are in conflict in domains where optimal planning is NP-hard.  ( 2 min )
    Learning more skills through optimistic exploration. (arXiv:2107.14226v6 [cs.LG] UPDATED)
    Unsupervised skill learning objectives (Gregor et al., 2016, Eysenbach et al., 2018) allow agents to learn rich repertoires of behavior in the absence of extrinsic rewards. They work by simultaneously training a policy to produce distinguishable latent-conditioned trajectories, and a discriminator to evaluate distinguishability by trying to infer latents from trajectories. The hope is for the agent to explore and master the environment by encouraging each skill (latent) to reliably reach different states. However, an inherent exploration problem lingers: when a novel state is actually encountered, the discriminator will necessarily not have seen enough training data to produce accurate and confident skill classifications, leading to low intrinsic reward for the agent and effective penalization of the sort of exploration needed to actually maximize the objective. To combat this inherent pessimism towards exploration, we derive an information gain auxiliary objective that involves training an ensemble of discriminators and rewarding the policy for their disagreement. Our objective directly estimates the epistemic uncertainty that comes from the discriminator not having seen enough training examples, thus providing an intrinsic reward more tailored to the true objective compared to pseudocount-based methods (Burda et al., 2019). We call this exploration bonus discriminator disagreement intrinsic reward, or DISDAIN. We demonstrate empirically that DISDAIN improves skill learning both in a tabular grid world (Four Rooms) and the 57 games of the Atari Suite (from pixels). Thus, we encourage researchers to treat pessimism with DISDAIN.  ( 2 min )
    AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control. (arXiv:2104.02180v2 [cs.GR] UPDATED)
    Synthesizing graceful and life-like behaviors for physically simulated characters has been a fundamental challenge in computer animation. Data-driven methods that leverage motion tracking are a prominent class of techniques for producing high fidelity motions for a wide range of behaviors. However, the effectiveness of these tracking-based methods often hinges on carefully designed objective functions, and when applied to large and diverse motion datasets, these methods require significant additional machinery to select the appropriate motion for the character to track in a given scenario. In this work, we propose to obviate the need to manually design imitation objectives and mechanisms for motion selection by utilizing a fully automated approach based on adversarial imitation learning. High-level task objectives that the character should perform can be specified by relatively simple reward functions, while the low-level style of the character's behaviors can be specified by a dataset of unstructured motion clips, without any explicit clip selection or sequencing. These motion clips are used to train an adversarial motion prior, which specifies style-rewards for training the character through reinforcement learning (RL). The adversarial RL procedure automatically selects which motion to perform, dynamically interpolating and generalizing from the dataset. Our system produces high-quality motions that are comparable to those achieved by state-of-the-art tracking-based techniques, while also being able to easily accommodate large datasets of unstructured motion clips. Composition of disparate skills emerges automatically from the motion prior, without requiring a high-level motion planner or other task-specific annotations of the motion clips. We demonstrate the effectiveness of our framework on a diverse cast of complex simulated characters and a challenging suite of motor control tasks.  ( 3 min )
    A Comparative Survey of Deep Active Learning. (arXiv:2203.13450v2 [cs.LG] UPDATED)
    Active Learning (AL) is a set of techniques for reducing labeling cost by sequentially selecting data samples from a large unlabeled data pool for labeling. Meanwhile, Deep Learning (DL) is data-hungry, and the performance of DL models scales monotonically with more training data. Therefore, in recent years, Deep Active Learning (DAL) has risen as feasible solutions for maximizing model performance while minimizing the expensive labeling cost. Abundant methods have sprung up and literature reviews of DAL have been presented before. However, the performance comparison of different branches of DAL methods under various tasks is still insufficient and our work fills this gap. In this paper, we survey and categorize DAL-related work and construct comparative experiments across frequently used datasets and DAL algorithms. Additionally, we explore some factors (e.g., batch size, number of epochs in the training process) that influence the efficacy of DAL, which provides better references for researchers to design their own DAL experiments or carry out DAL-related applications. We construct a DAL toolkit, DeepAL+, by re-implementing many highly-cited DAL-related methods, and it will be released to the public.  ( 2 min )
    Exploiting Inductive Bias in Transformers for Unsupervised Disentanglement of Syntax and Semantics with VAEs. (arXiv:2205.05943v1 [cs.CL])
    We propose a generative model for text generation, which exhibits disentangled latent representations of syntax and semantics. Contrary to previous work, this model does not need syntactic information such as constituency parses, or semantic information such as paraphrase pairs. Our model relies solely on the inductive bias found in attention-based architectures such as Transformers. In the attention of Transformers, keys handle information selection while values specify what information is conveyed. Our model, dubbed QKVAE, uses Attention in its decoder to read latent variables where one latent variable infers keys while another infers values. We run experiments on latent representations and experiments on syntax/semantics transfer which show that QKVAE displays clear signs of disentangled syntax and semantics. We also show that our model displays competitive syntax transfer capabilities when compared to supervised models and that comparable supervised models need a fairly large amount of data (more than 50K samples) to outperform it on both syntactic and semantic transfer. The code for our experiments is publicly available.  ( 2 min )
    Exploiting symmetry in variational quantum machine learning. (arXiv:2205.06217v1 [quant-ph])
    Variational quantum machine learning is an extensively studied application of near-term quantum computers. The success of variational quantum learning models crucially depends on finding a suitable parametrization of the model that encodes an inductive bias relevant to the learning task. However, precious little is known about guiding principles for the construction of suitable parametrizations. In this work, we holistically explore when and how symmetries of the learning problem can be exploited to construct quantum learning models with outcomes invariant under the symmetry of the learning task. Building on tools from representation theory, we show how a standard gateset can be transformed into an equivariant gateset that respects the symmetries of the problem at hand through a process of gate symmetrization. We benchmark the proposed methods on two toy problems that feature a non-trivial symmetry and observe a substantial increase in generalization performance. As our tools can also be applied in a straightforward way to other variational problems with symmetric structure, we show how equivariant gatesets can be used in variational quantum eigensolvers.  ( 2 min )
    Localized Vision-Language Matching for Open-vocabulary Object Detection. (arXiv:2205.06160v1 [cs.CV])
    In this work, we propose an open-world object detection method that, based on image-caption pairs, learns to detect novel object classes along with a given set of known classes. It is a two-stage training approach that first uses a location-guided image-caption matching technique to learn class labels for both novel and known classes in a weakly-supervised manner and second specializes the model for the object detection task using known class annotations. We show that a simple language model fits better than a large contextualized language model for detecting novel objects. Moreover, we introduce a consistency-regularization technique to better exploit image-caption pair information. Our method compares favorably to existing open-world detection approaches while being data-efficient.  ( 2 min )
    Equivariant quantum circuits for learning on weighted graphs. (arXiv:2205.06109v1 [quant-ph])
    Variational quantum algorithms are the leading candidate for near-term advantage on noisy quantum hardware. When training a parametrized quantum circuit to solve a specific task, the choice of ansatz is one of the most important factors that determines the trainability and performance of the algorithm. Problem-tailored ansatzes have become the standard for tasks in optimization or quantum chemistry, and yield more efficient algorithms with better performance than unstructured approaches. In quantum machine learning (QML), however, the literature on ansatzes that are motivated by the training data structure is scarce. Considering that it is widely known that unstructured ansatzes can become untrainable with increasing system size and circuit depth, it is of key importance to also study problem-tailored circuit architectures in a QML context. In this work, we introduce an ansatz for learning tasks on weighted graphs that respects an important graph symmetry, namely equivariance under node permutations. We evaluate the performance of this ansatz on a complex learning task on weighted graphs, where a ML model is used to implement a heuristic for a combinatorial optimization problem. We analytically study the expressivity of our ansatz at depth one, and numerically compare the performance of our model on instances with up to 20 qubits to ansatzes where the equivariance property is gradually broken. We show that our ansatz outperforms all others even in the small-instance regime. Our results strengthen the notion that symmetry-preserving ansatzes are a key to success in QML and should be an active area of research in order to enable near-term advantages in this field.  ( 2 min )
    Delving into High-Quality Synthetic Face Occlusion Segmentation Datasets. (arXiv:2205.06218v1 [cs.CV])
    This paper performs comprehensive analysis on datasets for occlusion-aware face segmentation, a task that is crucial for many downstream applications. The collection and annotation of such datasets are time-consuming and labor-intensive. Although some efforts have been made in synthetic data generation, the naturalistic aspect of data remains less explored. In our study, we propose two occlusion generation techniques, Naturalistic Occlusion Generation (NatOcc), for producing high-quality naturalistic synthetic occluded faces; and Random Occlusion Generation (RandOcc), a more general synthetic occluded data generation method. We empirically show the effectiveness and robustness of both methods, even for unseen occlusions. To facilitate model evaluation, we present two high-resolution real-world occluded face datasets with fine-grained annotations, RealOcc and RealOcc-Wild, featuring both careful alignment preprocessing and an in-the-wild setting for robustness test. We further conduct a comprehensive analysis on a newly introduced segmentation benchmark, offering insights for future exploration.  ( 2 min )
    E-Mail Assistant -- Automation of E-Mail Handling and Management using Robotic Process Automation. (arXiv:2205.05882v1 [cs.LG])
    In this paper, a workflow for designing a bot using Robotic Process Automation (RPA), associated with Artificial Intelligence (AI) that is used for information extraction, classification, etc., is proposed. The bot is equipped with many features that make email handling a stress-free job. It automatically login into the mailbox through secured channels, distinguishes between the useful and not useful emails, classifies the emails into different labels, downloads the attached files, creates different directories, and stores the downloaded files into relevant directories. It moves the not useful emails into the trash. Further, the bot can also be trained to rename the attached files with the names of the sender/applicant in case of a job application for the sake of convenience. The bot is designed and tested using the UiPath tool to improve the performance of the system. The paper also discusses the further possible functionalities that can be added on to the bot.  ( 2 min )
    Towards Robust Unsupervised Disentanglement of Sequential Data -- A Case Study Using Music Audio. (arXiv:2205.05871v1 [cs.SD])
    Disentangled sequential autoencoders (DSAEs) represent a class of probabilistic graphical models that describes an observed sequence with dynamic latent variables and a static latent variable. The former encode information at a frame rate identical to the observation, while the latter globally governs the entire sequence. This introduces an inductive bias and facilitates unsupervised disentanglement of the underlying local and global factors. In this paper, we show that the vanilla DSAE suffers from being sensitive to the choice of model architecture and capacity of the dynamic latent variables, and is prone to collapse the static latent variable. As a countermeasure, we propose TS-DSAE, a two-stage training framework that first learns sequence-level prior distributions, which are subsequently employed to regularise the model and facilitate auxiliary objectives to promote disentanglement. The proposed framework is fully unsupervised and robust against the global factor collapse problem across a wide range of model configurations. It also avoids typical solutions such as adversarial training which usually involves laborious parameter tuning, and domain-specific data augmentation. We conduct quantitative and qualitative evaluations to demonstrate its robustness in terms of disentanglement on both artificial and real-world music audio datasets.  ( 2 min )
    Unified Source-Filter GAN with Harmonic-plus-Noise Source Excitation Generation. (arXiv:2205.06053v1 [cs.SD])
    This paper introduces a unified source-filter network with a harmonic-plus-noise source excitation generation mechanism. In our previous work, we proposed unified Source-Filter GAN (uSFGAN) for developing a high-fidelity neural vocoder with flexible voice controllability using a unified source-filter neural network architecture. However, the capability of uSFGAN to model the aperiodic source excitation signal is insufficient, and there is still a gap in sound quality between the natural and generated speech. To improve the source excitation modeling and generated sound quality, a new source excitation generation network separately generating periodic and aperiodic components is proposed. The advanced adversarial training procedure of HiFiGAN is also adopted to replace that of Parallel WaveGAN used in the original uSFGAN. Both objective and subjective evaluation results show that the modified uSFGAN significantly improves the sound quality of the basic uSFGAN while maintaining the voice controllability.  ( 2 min )
    Optimal transport weights for causal inference. (arXiv:2109.01991v4 [stat.ME] UPDATED)
    Imbalance in covariate distributions leads to biased estimates of causal effects. Weighting methods attempt to correct this imbalance but rely on specifying models for the treatment assignment mechanism, which is unknown in observational studies. This leaves researchers to choose the proper weighting method and the appropriate covariate functions for these models without knowing the correct combination to achieve distributional balance. In response to these difficulties, we propose a nonparametric generalization of several other weighting schemes found in the literature: Causal Optimal Transport. This new method directly targets distributional balance by minimizing optimal transport distances between treatment and control groups or, more generally, between any source and target population. Our approach is semiparametrically efficient and model-free but can also incorporate moments or any other important functions of covariates that a researcher desires to balance. Moreover, our method can provide nonparametric estimate the conditional mean outcome function and we give rates for the convergence of this estimator. Moreover, we show how this method can provide nonparametric imputations of the missing potential outcomes and give rates of convergence for this estimator. We find that Causal Optimal Transport outperforms competitor methods when both the propensity score and outcome models are misspecified, indicating it is a robust alternative to common weighting methods. Finally, we demonstrate the utility of our method in an external control trial examining the effect of misoprostol versus oxytocin for the treatment of post-partum hemorrhage.  ( 2 min )
    Communicative Subgraph Representation Learning for Multi-Relational Inductive Drug-Gene Interaction Prediction. (arXiv:2205.05957v1 [cs.LG])
    Illuminating the interconnections between drugs and genes is an important topic in drug development and precision medicine. Currently, computational predictions of drug-gene interactions mainly focus on the binding interactions without considering other relation types like agonist, antagonist, etc. In addition, existing methods either heavily rely on high-quality domain features or are intrinsically transductive, which limits the capacity of models to generalize to drugs/genes that lack external information or are unseen during the training process. To address these problems, we propose a novel Communicative Subgraph representation learning for Multi-relational Inductive drug-Gene interactions prediction (CoSMIG), where the predictions of drug-gene relations are made through subgraph patterns, and thus are naturally inductive for unseen drugs/genes without retraining or utilizing external domain features. Moreover, the model strengthened the relations on the drug-gene graph through a communicative message passing mechanism. To evaluate our method, we compiled two new benchmark datasets from DrugBank and DGIdb. The comprehensive experiments on the two datasets showed that our method outperformed state-of-the-art baselines in the transductive scenarios and achieved superior performance in the inductive ones. Further experimental analysis including LINCS experimental validation and literature verification also demonstrated the value of our model.
    Embodied vision for learning object representations. (arXiv:2205.06198v1 [cs.LG])
    Recent time-contrastive learning approaches manage to learn invariant object representations without supervision. This is achieved by mapping successive views of an object onto close-by internal representations. When considering this learning approach as a model of the development of human object recognition, it is important to consider what visual input a toddler would typically observe while interacting with objects. First, human vision is highly foveated, with high resolution only available in the central region of the field of view. Second, objects may be seen against a blurry background due to infants' limited depth of field. Third, during object manipulation a toddler mostly observes close objects filling a large part of the field of view due to their rather short arms. Here, we study how these effects impact the quality of visual representations learnt through time-contrastive learning. To this end, we let a visually embodied agent "play" with objects in different locations of a near photo-realistic flat. During each play session the agent views an object in multiple orientations before turning its body to view another object. The resulting sequence of views feeds a time-contrastive learning algorithm. Our results show that visual statistics mimicking those of a toddler improve object recognition accuracy in both familiar and novel environments. We argue that this effect is caused by the reduction of features extracted in the background, a neural network bias for large features in the image and a greater similarity between novel and familiar background regions. We conclude that the embodied nature of visual learning may be crucial for understanding the development of human object perception.
    Smooth-Reduce: Leveraging Patches for Improved Certified Robustness. (arXiv:2205.06154v1 [cs.LG])
    Randomized smoothing (RS) has been shown to be a fast, scalable technique for certifying the robustness of deep neural network classifiers. However, methods based on RS require augmenting data with large amounts of noise, which leads to significant drops in accuracy. We propose a training-free, modified smoothing approach, Smooth-Reduce, that leverages patching and aggregation to provide improved classifier certificates. Our algorithm classifies overlapping patches extracted from an input image, and aggregates the predicted logits to certify a larger radius around the input. We study two aggregation schemes -- max and mean -- and show that both approaches provide better certificates in terms of certified accuracy, average certified radii and abstention rates as compared to concurrent approaches. We also provide theoretical guarantees for such certificates, and empirically show significant improvements over other randomized smoothing methods that require expensive retraining. Further, we extend our approach to videos and provide meaningful certificates for video classifiers. A project page can be found at https://nyu-dice-lab.github.io/SmoothReduce/  ( 2 min )
    GPN: A Joint Structural Learning Framework for Graph Neural Networks. (arXiv:2205.05964v1 [cs.LG])
    Graph neural networks (GNNs) have been applied into a variety of graph tasks. Most existing work of GNNs is based on the assumption that the given graph data is optimal, while it is inevitable that there exists missing or incomplete edges in the graph data for training, leading to degraded performance. In this paper, we propose Generative Predictive Network (GPN), a GNN-based joint learning framework that simultaneously learns the graph structure and the downstream task. Specifically, we develop a bilevel optimization framework for this joint learning task, in which the upper optimization (generator) and the lower optimization (predictor) are both instantiated with GNNs. To the best of our knowledge, our method is the first GNN-based bilevel optimization framework for resolving this task. Through extensive experiments, our method outperforms a wide range of baselines using benchmark datasets.
    Graph Masked Autoencoders with Transformers. (arXiv:2202.08391v2 [cs.LG] UPDATED)
    Recently, transformers have shown promising performance in learning graph representations. However, there are still some challenges when applying transformers to real-world scenarios due to the fact that deep transformers are hard to train from scratch and the quadratic memory consumption w.r.t. the number of nodes. In this paper, we propose Graph Masked Autoencoders (GMAEs), a self-supervised transformer-based model for learning graph representations. To address the above two challenges, we adopt the masking mechanism and the asymmetric encoder-decoder design. Specifically, GMAE takes partially masked graphs as input, and reconstructs the features of the masked nodes. The encoder and decoder are asymmetric, where the encoder is a deep transformer and the decoder is a shallow transformer. The masking mechanism and the asymmetric design make GMAE a memory-efficient model compared with conventional transformers. We show that, when serving as a conventional self-supervised graph representation model, GMAE achieves state-of-the-art performance on both the graph classification task and the node classification task under common downstream evaluation protocols. We also show that, compared with training in an end-to-end manner from scratch, we can achieve comparable performance after pre-training and fine-tuning using GMAE while simplifying the training process.
    A non-asymptotic approach for model selection via penalization in high-dimensional mixture of experts models. (arXiv:2104.02640v2 [math.ST] UPDATED)
    Mixture of experts (MoE) are a popular class of statistical and machine learning models that have gained attention over the years due to their flexibility and efficiency. In this work, we consider Gaussian-gated localized MoE (GLoME) and block-diagonal covariance localized MoE (BLoME) regression models to present nonlinear relationships in heterogeneous data with potential hidden graph-structured interactions between high-dimensional predictors. These models pose difficult statistical estimation and model selection questions, both from a computational and theoretical perspective. This paper is devoted to the study of the problem of model selection among a collection of GLoME or BLoME models characterized by the number of mixture components, the complexity of Gaussian mean experts, and the hidden block-diagonal structures of the covariance matrices, in a penalized maximum likelihood estimation framework. In particular, we establish non-asymptotic risk bounds that take the form of weak oracle inequalities, provided that lower bounds for the penalties hold. The good empirical behavior of our models is then demonstrated on synthetic and real datasets.
    Robustness and Reliability When Training With Noisy Labels. (arXiv:2110.03321v2 [stat.ML] UPDATED)
    Labelling of data for supervised learning can be costly and time-consuming and the risk of incorporating label noise in large data sets is imminent. When training a flexible discriminative model using a strictly proper loss, such noise will inevitably shift the solution towards the conditional distribution over noisy labels. Nevertheless, while deep neural networks have proven capable of fitting random labels, regularisation and the use of robust loss functions empirically mitigate the effects of label noise. However, such observations concern robustness in accuracy, which is insufficient if reliable uncertainty quantification is critical. We demonstrate this by analysing the properties of the conditional distribution over noisy labels for an input-dependent noise model. In addition, we evaluate the set of robust loss functions characterised by noise-insensitive, asymptotic risk minimisers. We find that strictly proper and robust loss functions both offer asymptotic robustness in accuracy, but neither guarantee that the final model is calibrated. Moreover, even with robust loss functions, overfitting is an issue in practice. With these results, we aim to explain observed robustness of common training practices, such as early stopping, to label noise. In addition, we aim to encourage the development of new noise-robust algorithms that not only preserve accuracy but that also ensure reliability.
    A Lightweight Instrument-Agnostic Model for Polyphonic Note Transcription and Multipitch Estimation. (arXiv:2203.09893v2 [cs.SD] UPDATED)
    Automatic Music Transcription (AMT) has been recognized as a key enabling technology with a wide range of applications. Given the task's complexity, best results have typically been reported for systems focusing on specific settings, e.g. instrument-specific systems tend to yield improved results over instrument-agnostic methods. Similarly, higher accuracy can be obtained when only estimating frame-wise $f_0$ values and neglecting the harder note event detection. Despite their high accuracy, such specialized systems often cannot be deployed in the real-world. Storage and network constraints prohibit the use of multiple specialized models, while memory and run-time constraints limit their complexity. In this paper, we propose a lightweight neural network for musical instrument transcription, which supports polyphonic outputs and generalizes to a wide variety of instruments (including vocals). Our model is trained to jointly predict frame-wise onsets, multipitch and note activations, and we experimentally show that this multi-output structure improves the resulting frame-level note accuracy. Despite its simplicity, benchmark results show our system's note estimation to be substantially better than a comparable baseline, and its frame-level accuracy to be only marginally below those of specialized state-of-the-art AMT systems. With this work we hope to encourage the community to further investigate low-resource, instrument-agnostic AMT systems.
    Virtual twins of nonlinear vibrating multiphysics microstructures: physics-based versus deep learning-based approaches. (arXiv:2205.05928v1 [math.DS])
    Micro-Electro-Mechanical-Systems are complex structures, often involving nonlinearites of geometric and multiphysics nature, that are used as sensors and actuators in countless applications. Starting from full-order representations, we apply deep learning techniques to generate accurate, efficient and real-time reduced order models to be used as virtual twin for the simulation and optimization of higher-level complex systems. We extensively test the reliability of the proposed procedures on micromirrors, arches and gyroscopes, also displaying intricate dynamical evolutions like internal resonances. In particular, we discuss the accuracy of the deep learning technique and its ability to replicate and converge to the invariant manifolds predicted using the recently developed direct parametrization approach that allows extracting the nonlinear normal modes of large finite element models. Finally, by addressing an electromechanical gyroscope, we show that the non-intrusive deep learning approach generalizes easily to complex multiphysics problems
    Pseudo-Label Guided Multi-Contrast Generalization for Non-Contrast Organ-Aware Segmentation. (arXiv:2205.05898v1 [eess.IV])
    Non-contrast computed tomography (NCCT) is commonly acquired for lung cancer screening, assessment of general abdominal pain or suspected renal stones, trauma evaluation, and many other indications. However, the absence of contrast limits distinguishing organ in-between boundaries. In this paper, we propose a novel unsupervised approach that leverages pairwise contrast-enhanced CT (CECT) context to compute non-contrast segmentation without ground-truth label. Unlike generative adversarial approaches, we compute the pairwise morphological context with CECT to provide teacher guidance instead of generating fake anatomical context. Additionally, we further augment the intensity correlations in 'organ-specific' settings and increase the sensitivity to organ-aware boundary. We validate our approach on multi-organ segmentation with paired non-contrast & contrast-enhanced CT scans using five-fold cross-validation. Full external validations are performed on an independent non-contrast cohort for aorta segmentation. Compared with current abdominal organs segmentation state-of-the-art in fully supervised setting, our proposed pipeline achieves a significantly higher Dice by 3.98% (internal multi-organ annotated), and 8.00% (external aorta annotated) for abdominal organs segmentation. The code and pretrained models are publicly available at https://github.com/MASILab/ContrastMix.  ( 2 min )
    SIBILA: High-performance computing and interpretable machine learning join efforts toward personalised medicine in a novel decision-making tool. (arXiv:2205.06234v1 [cs.LG])
    Background and Objectives: Personalised medicine remains a major challenge for scientists. The rapid growth of Machine learning and Deep learning has made it a feasible alternative for predicting the most appropriate therapy for individual patients. However, the lack of interpretation of their results and high computational requirements make many reluctant to use these methods. Methods: Several Machine learning and Deep learning models have been implemented into a single software tool, SIBILA. Once the models are trained, SIBILA applies a range of interpretability methods to identify the input features that each model considered the most important to predict. In addition, all the features obtained are put in common to estimate the global attribution of each variable to the predictions. To facilitate its use by non-experts, SIBILA is also available to all users free of charge as a web server at https://bio-hpc.ucam.edu/sibila/. Results: SIBILA has been applied to three case studies to show its accuracy and efficiency in classification and regression problems. The first two cases proved that SIBILA can make accurate predictions even on uncleaned datasets. The last case demonstrates that SIBILA can be applied to medical contexts with real data. Conclusion: With the aim of becoming a powerful decision-making tool for clinicians, SIBILA has been developed. SIBILA is a novel software tool that leverages interpretable machine learning to make accurate predictions and explain how models made those decisions. SIBILA can be run on high-performance computing platforms, drastically reducing computing times.  ( 2 min )
    Combining Learning from Human Feedback and Knowledge Engineering to Solve Hierarchical Tasks in Minecraft. (arXiv:2112.03482v2 [cs.LG] UPDATED)
    Real-world tasks of interest are generally poorly defined by human-readable descriptions and have no pre-defined reward signals unless it is defined by a human designer. Conversely, data-driven algorithms are often designed to solve a specific, narrowly defined, task with performance metrics that drives the agent's learning. In this work, we present the solution that won first place and was awarded the most human-like agent in the 2021 NeurIPS Competition MineRL BASALT Challenge: Learning from Human Feedback in Minecraft, which challenged participants to use human data to solve four tasks defined only by a natural language description and no reward function. Our approach uses the available human demonstration data to train an imitation learning policy for navigation and additional human feedback to train an image classifier. These modules, combined with an estimated odometry map, become a powerful state-machine designed to utilize human knowledge in a natural hierarchical paradigm. We compare this hybrid intelligence approach to both end-to-end machine learning and pure engineered solutions, which are then judged by human evaluators. Codebase is available at https://github.com/viniciusguigo/kairos_minerl_basalt.  ( 2 min )
    An $l_1$-oracle inequality for the Lasso in mixture-of-experts regression models. (arXiv:2009.10622v3 [math.ST] UPDATED)
    Mixture-of-experts (MoE) models are a popular framework for modeling heterogeneity in data, for both regression and classification problems in statistics and machine learning, due to their flexibility and the abundance of available statistical estimation and model choice tools. Such flexibility comes from allowing the mixture weights (or gating functions) in the MoE model to depend on the explanatory variables, along with the experts (or component densities). This permits the modeling of data arising from more complex data generating processes when compared to the classical finite mixtures and finite mixtures of regression models, whose mixing parameters are independent of the covariates. The use of MoE models in a high-dimensional setting, when the number of explanatory variables can be much larger than the sample size, is challenging from a computational point of view, and in particular from a theoretical point of view, where the literature is still lacking results for dealing with the curse of dimensionality, for both the statistical estimation and feature selection problems. We consider the finite MoE model with soft-max gating functions and Gaussian experts for high-dimensional regression on heterogeneous data, and its $l_1$-regularized estimation via the Lasso. We focus on the Lasso estimation properties rather than its feature selection properties. We provide a lower bound on the regularization parameter of the Lasso function that ensures an $l_1$-oracle inequality satisfied by the Lasso estimator according to the Kullback--Leibler loss.
    Social learning via actions in bandit environments. (arXiv:2205.06107v1 [econ.TH])
    I study a game of strategic exploration with private payoffs and public actions in a Bayesian bandit setting. In particular, I look at cascade equilibria, in which agents switch over time from the risky action to the riskless action only when they become sufficiently pessimistic. I show that these equilibria exist under some conditions and establish their salient properties. Individual exploration in these equilibria can be more or less than the single-agent level depending on whether the agents start out with a common prior or not, but the most optimistic agent always underexplores. I also show that allowing the agents to write enforceable ex-ante contracts will lead to the most ex-ante optimistic agent to buy all payoff streams, providing an explanation to the buying out of smaller start-ups by more established firms.  ( 2 min )
    The Implicit Bias of Benign Overfitting. (arXiv:2201.11489v3 [cs.LG] UPDATED)
    The phenomenon of benign overfitting, where a predictor perfectly fits noisy training data while attaining low expected loss, has received much attention in recent years, but still remains not fully understood beyond well-specified linear regression setups. In this paper, we provide several new results on when one can or cannot expect benign overfitting to occur, for both regression and classification tasks. We consider a prototypical and rather generic data model for benign overfitting of linear predictors, where an arbitrary input distribution of some fixed dimension $k$ is concatenated with a high-dimensional distribution. For linear regression which is not necessarily well-specified, we show that the minimum-norm interpolating predictor (that standard training methods converge to) is biased towards an inconsistent solution in general, hence benign overfitting will generally not occur. Moreover, we show how this can be extended beyond standard linear regression, by an argument proving how the existence of benign overfitting on some regression problems precludes its existence on other regression problems. We then turn to classification problems, and show that the situation there is much more favorable. Specifically, we prove that the max-margin predictor (to which standard training methods are known to converge in direction) is asymptotically biased towards minimizing a weighted squared hinge loss. This allows us to reduce the question of benign overfitting in classification to the simpler question of whether this loss is a good surrogate for the misclassification error, and use it to show benign overfitting in some new settings.
    Healthy Twitter discussions? Time will tell. (arXiv:2203.11261v2 [cs.SI] UPDATED)
    Studying misinformation and how to deal with unhealthy behaviours within online discussions has recently become an important field of research within social studies. With the rapid development of social media, and the increasing amount of available information and sources, rigorous manual analysis of such discourses has become unfeasible. Many approaches tackle the issue by studying the semantic and syntactic properties of discussions following a supervised approach, for example using natural language processing on a dataset labeled for abusive, fake or bot-generated content. Solutions based on the existence of a ground truth are limited to those domains which may have ground truth. However, within the context of misinformation, it may be difficult or even impossible to assign labels to instances. In this context, we consider the use of temporal dynamic patterns as an indicator of discussion health. Working in a domain for which ground truth was unavailable at the time (early COVID-19 pandemic discussions) we explore the characterization of discussions based on the the volume and time of contributions. First we explore the types of discussions in an unsupervised manner, and then characterize these types using the concept of ephemerality, which we formalize. In the end, we discuss the potential use of our ephemerality definition for labeling online discourses based on how desirable, healthy and constructive they are.
    Neural Network-based OFDM Receiver for Resource Constrained IoT Devices. (arXiv:2205.06159v1 [eess.SP])
    Orthogonal Frequency Division Multiplexing (OFDM)-based waveforms are used for communication links in many current and emerging Internet of Things (IoT) applications, including the latest WiFi standards. For such OFDM-based transceivers, many core physical layer functions related to channel estimation, demapping, and decoding are implemented for specific choices of channel types and modulation schemes, among others. To decouple hard-wired choices from the receiver chain and thereby enhance the flexibility of IoT deployment in many novel scenarios without changing the underlying hardware, we explore a novel, modular Machine Learning (ML)-based receiver chain design. Here, ML blocks replace the individual processing blocks of an OFDM receiver, and we specifically describe this swapping for the legacy channel estimation, symbol demapping, and decoding blocks with Neural Networks (NNs). A unique aspect of this modular design is providing flexible allocation of processing functions to the legacy or ML blocks, allowing them to interchangeably coexist. Furthermore, we study the implementation cost-benefits of the proposed NNs in resource-constrained IoT devices through pruning and quantization, as well as emulation of these compressed NNs within Field Programmable Gate Arrays (FPGAs). Our evaluations demonstrate that the proposed modular NN-based receiver improves bit error rate of the traditional non-ML receiver by averagely 61% and 10% for the simulated and over-the-air datasets, respectively. We further show complexity-performance tradeoffs by presenting computational complexity comparisons between the traditional algorithms and the proposed compressed NNs.
    Improved Sample Complexity Bounds for Branch-and-Cut. (arXiv:2111.11207v2 [cs.LG] UPDATED)
    Branch-and-cut is the most widely used algorithm for solving integer programs, employed by commercial solvers like CPLEX and Gurobi. Branch-and-cut has a wide variety of tunable parameters that have a huge impact on the size of the search tree that it builds, but are challenging to tune by hand. An increasingly popular approach is to use machine learning to tune these parameters: using a training set of integer programs from the application domain at hand, the goal is to find a configuration with strong predicted performance on future, unseen integer programs from the same domain. If the training set is too small, a configuration may have good performance over the training set but poor performance on future integer programs. In this paper, we prove sample complexity guarantees for this procedure, which bound how large the training set should be to ensure that for any configuration, its average performance over the training set is close to its expected future performance. Our guarantees apply to parameters that control the most important aspects of branch-and-cut: node selection, branching constraint selection, and cutting plane selection, and are sharper and more general than those found in prior research.
    AiSocrates: Towards Answering Ethical Quandary Questions. (arXiv:2205.05989v1 [cs.CL])
    Considerable advancements have been made in various NLP tasks based on the impressive power of large pre-trained language models (LLMs). These results have inspired efforts to understand the limits of LLMs so as to evaluate how far we are from achieving human level general natural language understanding. In this work, we challenge the capability of LLMs with the new task of Ethical Quandary Generative Question Answering. Ethical quandary questions are more challenging to address because multiple conflicting answers may exist to a single quandary. We propose a system, AiSocrates, that provides an answer with a deliberative exchange of different perspectives to an ethical quandary, in the approach of Socratic philosophy, instead of providing a closed answer like an oracle. AiSocrates searches for different ethical principles applicable to the ethical quandary and generates an answer conditioned on the chosen principles through prompt-based few-shot learning. We also address safety concerns by providing a human controllability option in choosing ethical principles. We show that AiSocrates generates promising answers to ethical quandary questions with multiple perspectives, 6.92% more often than answers written by human philosophers by one measure, but the system still needs improvement to match the coherence of human philosophers fully. We argue that AiSocrates is a promising step toward developing an NLP system that incorporates human values explicitly by prompt instructions. We are releasing the code for research purposes.
    Open-vocabulary Object Detection via Vision and Language Knowledge Distillation. (arXiv:2104.13921v3 [cs.CV] UPDATED)
    We aim at advancing open-vocabulary object detection, which detects objects described by arbitrary text inputs. The fundamental challenge is the availability of training data. It is costly to further scale up the number of classes contained in existing object detection datasets. To overcome this challenge, we propose ViLD, a training method via Vision and Language knowledge Distillation. Our method distills the knowledge from a pretrained open-vocabulary image classification model (teacher) into a two-stage detector (student). Specifically, we use the teacher model to encode category texts and image regions of object proposals. Then we train a student detector, whose region embeddings of detected boxes are aligned with the text and image embeddings inferred by the teacher. We benchmark on LVIS by holding out all rare categories as novel categories that are not seen during training. ViLD obtains 16.1 mask AP$_r$ with a ResNet-50 backbone, even outperforming the supervised counterpart by 3.8. When trained with a stronger teacher model ALIGN, ViLD achieves 26.3 AP$_r$. The model can directly transfer to other datasets without finetuning, achieving 72.2 AP$_{50}$ on PASCAL VOC, 36.6 AP on COCO and 11.8 AP on Objects365. On COCO, ViLD outperforms the previous state-of-the-art by 4.8 on novel AP and 11.4 on overall AP. Code and demo are open-sourced at https://github.com/tensorflow/tpu/tree/master/models/official/detection/projects/vild.
    Fair NLP Models with Differentially Private Text Encoders. (arXiv:2205.06135v1 [cs.CL])
    Encoded text representations often capture sensitive attributes about individuals (e.g., race or gender), which raise privacy concerns and can make downstream models unfair to certain groups. In this work, we propose FEDERATE, an approach that combines ideas from differential privacy and adversarial training to learn private text representations which also induces fairer models. We empirically evaluate the trade-off between the privacy of the representations and the fairness and accuracy of the downstream model on four NLP datasets. Our results show that FEDERATE consistently improves upon previous methods, and thus suggest that privacy and fairness can positively reinforce each other.  ( 2 min )
    Energy-Based Learning for Cooperative Games, with Applications to Valuation Problems in Machine Learning. (arXiv:2106.02938v4 [cs.LG] UPDATED)
    Valuation problems, such as feature interpretation, data valuation and model valuation for ensembles, become increasingly more important in many machine learning applications. Such problems are commonly solved by well-known game-theoretic criteria, such as Shapley value or Banzhaf value. In this work, we present a novel energy-based treatment for cooperative games, with a theoretical justification by the maximum entropy framework. Surprisingly, by conducting variational inference of the energy-based model, we recover various game-theoretic valuation criteria through conducting one-step fixed point iteration for maximizing the mean-field ELBO objective. This observation also verifies the rationality of existing criteria, as they are all attempting to decouple the correlations among the players through the mean-field approach. By running fixed point iteration for multiple steps, we achieve a trajectory of the valuations, among which we define the valuation with the best conceivable decoupling error as the Variational Index. We prove that under uniform initializations, these variational valuations all satisfy a set of game-theoretic axioms. We experimentally demonstrate that the proposed Variational Index enjoys lower decoupling error and better valuation performance on certain synthetic and real-world valuation problems.  ( 2 min )
    Generating Fair Universal Representations using Adversarial Models. (arXiv:1910.00411v7 [cs.LG] UPDATED)
    We present a data-driven framework for learning fair universal representations (FUR) that guarantee statistical fairness for any learning task that may not be known a priori. Our framework leverages recent advances in adversarial learning to allow a data holder to learn representations in which a set of sensitive attributes are decoupled from the rest of the dataset. We formulate this as a constrained minimax game between an encoder and an adversary where the constraint ensures a measure of usefulness (utility) of the representation. The resulting problem is that of censoring, i.e., finding a representation that is least informative about the sensitive attributes given a utility constraint. For appropriately chosen adversarial loss functions, our censoring framework precisely clarifies the optimal adversarial strategy against strong information-theoretic adversaries; it also achieves the fairness measure of demographic parity for the resulting constrained representations. We evaluate the performance of our proposed framework on both synthetic and publicly available datasets. For these datasets, we use two tradeoff measures: censoring vs. representation fidelity and fairness vs. utility for downstream tasks, to amply demonstrate that multiple sensitive features can be effectively censored even as the resulting fair representations ensure accuracy for multiple downstream tasks.  ( 2 min )
    So Cloze yet so Far: N400 Amplitude is Better Predicted by Distributional Information than Human Predictability Judgements. (arXiv:2109.01226v2 [cs.CL] UPDATED)
    More predictable words are easier to process - they are read faster and elicit smaller neural signals associated with processing difficulty, most notably, the N400 component of the event-related brain potential. Thus, it has been argued that prediction of upcoming words is a key component of language comprehension, and that studying the amplitude of the N400 is a valuable way to investigate the predictions we make. In this study, we investigate whether the linguistic predictions of computational language models or humans better reflect the way in which natural language stimuli modulate the amplitude of the N400. One important difference in the linguistic predictions of humans versus computational language models is that while language models base their predictions exclusively on the preceding linguistic context, humans may rely on other factors. We find that the predictions of three top-of-the-line contemporary language models - GPT-3, RoBERTa, and ALBERT - match the N400 more closely than human predictions. This suggests that the predictive processes underlying the N400 may be more sensitive to the surface-level statistics of language than previously thought.  ( 2 min )
    Efficient Federated Learning for AIoT Applications Using Knowledge Distillation. (arXiv:2111.14347v2 [cs.LG] UPDATED)
    As a promising distributed machine learning paradigm, Federated Learning (FL) trains a central model with decentralized data without compromising user privacy, which has made it widely used by Artificial Intelligence Internet of Things (AIoT) applications. However, the traditional FL suffers from model inaccuracy since it trains local models using hard labels of data and ignores useful information of incorrect predictions with small probabilities. Although various solutions try to tackle the bottleneck of the traditional FL, most of them introduce significant communication and memory overhead, making the deployment of large-scale AIoT devices a great challenge. To address the above problem, this paper presents a novel Distillation-based Federated Learning (DFL) architecture that enables efficient and accurate FL for AIoT applications. Inspired by Knowledge Distillation (KD) that can increase the model accuracy, our approach adds the soft targets used by KD to the FL model training, which occupies negligible network resources. The soft targets are generated by local sample predictions of each AIoT device after each round of local training and used for the next round of model training. During the local training of DFL, both soft targets and hard labels are used as approximation objectives of model predictions to improve model accuracy by supplementing the knowledge of soft targets. To further improve the performance of our DFL model, we design a dynamic adjustment strategy for tuning the ratio of two loss functions used in KD, which can maximize the use of both soft targets and hard labels. Comprehensive experimental results on well-known benchmarks show that our approach can significantly improve the model accuracy of FL with both Independent and Identically Distributed (IID) and non-IID data.  ( 2 min )
    Training Uncertainty-Aware Classifiers with Conformalized Deep Learning. (arXiv:2205.05878v1 [stat.ML])
    Deep neural networks are powerful tools to detect hidden patterns in data and leverage them to make predictions, but they are not designed to understand uncertainty and estimate reliable probabilities. In particular, they tend to be overconfident. We address this problem by developing a novel training algorithm that can lead to more dependable uncertainty estimates, without sacrificing predictive power. The idea is to mitigate overconfidence by minimizing a loss function, inspired by advances in conformal inference, that quantifies model uncertainty by carefully leveraging hold-out data. Experiments with synthetic and real data demonstrate this method leads to smaller conformal prediction sets with higher conditional coverage, after exact calibration with hold-out data, compared to state-of-the-art alternatives.  ( 2 min )
    Performing Video Frame Prediction of Microbial Growth with a Recurrent Neural Network. (arXiv:2205.05810v1 [cs.LG])
    A Recurrent Neural Network (RNN) was used to perform video frame prediction of microbial growth for a population of two mutants of Pseudomonas aeruginosa. The RNN was trained on videos of 20 frames that were acquired using fluorescence microscopy and microfluidics. The network predicted the last 10 frames of each video, and the accuracy's of the predictions was assessed by comparing raw images, population curves, and the number and size of individual colonies. Overall, we found the predictions to be accurate using this approach. The implications this result has on designing autonomous experiments in microbiology, and the steps that can be taken to make the predictions even more accurate, are discussed.  ( 2 min )
    Subgroup discovery of Parkinson's Disease by utilizing a multi-modal smart device system. (arXiv:2205.05961v1 [cs.LG])
    In recent years, sensors from smart consumer devices have shown great diagnostic potential in movement disorders. In this context, data modalities such as electronic questionnaires, hand movement and voice captures have successfully captured biomarkers and allowed discrimination between Parkinson's disease (PD) and healthy controls (HC) or differential diagnosis (DD). However, to the best of our knowledge, a comprehensive evaluation of assessments with a multi-modal smart device system has still been lacking. In a prospective study exploring PD, we used smartwatches and smartphones to collect multi-modal data from 504 participants, including PD patients, DD and HC. This study aims to assess the effect of multi-modal vs. single-modal data on PD vs. HC and PD vs. DD classification, as well as on PD group clustering for subgroup identification. We were able to show that by combining various modalities, classification accuracy improved and further PD clusters were discovered.  ( 2 min )
    Topologically-Aware Deformation Fields for Single-View 3D Reconstruction. (arXiv:2205.06267v1 [cs.CV])
    We present a new framework for learning 3D object shapes and dense cross-object 3D correspondences from just an unaligned category-specific image collection. The 3D shapes are generated implicitly as deformations to a category-specific signed distance field and are learned in an unsupervised manner solely from unaligned image collections without any 3D supervision. Generally, image collections on the internet contain several intra-category geometric and topological variations, for example, different chairs can have different topologies, which makes the task of joint shape and correspondence estimation much more challenging. Because of this, prior works either focus on learning each 3D object shape individually without modeling cross-instance correspondences or perform joint shape and correspondence estimation on categories with minimal intra-category topological variations. We overcome these restrictions by learning a topologically-aware implicit deformation field that maps a 3D point in the object space to a higher dimensional point in the category-specific canonical space. At inference time, given a single image, we reconstruct the underlying 3D shape by first implicitly deforming each 3D point in the object space to the learned category-specific canonical space using the topologically-aware deformation field and then reconstructing the 3D shape as a canonical signed distance field. Both canonical shape and deformation field are learned end-to-end in an inverse-graphics fashion using a learned recurrent ray marcher (SRN) as a differentiable rendering module. Our approach, dubbed TARS, achieves state-of-the-art reconstruction fidelity on several datasets: ShapeNet, Pascal3D+, CUB, and Pix3D chairs. Result videos and code at https://shivamduggal4.github.io/tars-3D/  ( 2 min )
    Data Distributional Properties Drive Emergent Few-Shot Learning in Transformers. (arXiv:2205.05055v2 [cs.AI] CROSS LISTED)
    Large transformer-based language models are able to perform few-shot learning (also known as in-context learning), without having been explicitly trained for it. We hypothesized that specific distributional properties of natural language might drive this emergent phenomenon, as these characteristics might lead to a kind of interpolation between few-shot meta-training (designed to elicit rapid few-shot learning) and standard supervised training (designed to elicit gradual in-weights learning). We also hypothesized that these distributional properties could lead to emergent few-shot learning in domains outside of language. Inspired by this idea, we ran a series of experiments on a standard image-based few-shot dataset. We discovered that a number of data properties did indeed promote the emergence of few-shot learning in transformer models. All of these properties are present in natural language -- burstiness, long-tailedness, and many-to-one or one-to-many label mappings. The data influenced whether models were biased towards either few-shot learning vs. memorizing information in their weights; models could generally perform well at only one or the other. However, we discovered that an additional distributional property could allow the two capabilities to co-exist in the same model -- a skewed, Zipfian distribution over classes -- which occurs in language as well. Notably, training data that could elicit few-shot learning in transformers were unable to elicit few-shot learning in recurrent models. In sum, we find that few-shot learning emerges only from applying the right architecture to the right data distribution; neither component is sufficient on its own.  ( 2 min )
    Automatic Segmentation of Head and Neck Tumor: How Powerful Transformers Are?. (arXiv:2201.06251v2 [eess.IV] UPDATED)
    Cancer is one of the leading causes of death worldwide, and head and neck (H&N) cancer is amongst the most prevalent types. Positron emission tomography and computed tomography are used to detect, segment and quantify the tumor region. Clinically, tumor segmentation is extensively time-consuming and prone to error. Machine learning, and deep learning in particular, can assist to automate this process, yielding results as accurate as the results of a clinician. In this paper, we investigate a vision transformer-based method to automatically delineate H&N tumor, and compare its results to leading convolutional neural network (CNN)-based models. We use multi-modal data from CT and PET scans to perform the segmentation task. We show that a solution with a transformer-based model has the potential to achieve comparable results to CNN-based ones. With cross validation, the model achieves a mean dice similarity coefficient (DSC) of 0.736, mean precision of 0.766 and mean recall of 0.766. This is only 0.021 less than the 2020 competition winning model (cross validated in-house) in terms of the DSC score. On the testing set, the model performs similarly, with DSC of 0.736, precision of 0.773, and recall of 0.760, which is only 0.023 lower in DSC than the 2020 competition winning model. This work shows that cancer segmentation via transformer-based models is a promising research area to further explore.  ( 2 min )
    kNN-Embed: Locally Smoothed Embedding Mixtures For Multi-interest Candidate Retrieval. (arXiv:2205.06205v1 [cs.IR])
    Candidate generation is the first stage in recommendation systems, where a light-weight system is used to retrieve potentially relevant items for an input user. These candidate items are then ranked and pruned in later stages of recommender systems using a more complex ranking model. Since candidate generation is the top of the recommendation funnel, it is important to retrieve a high-recall candidate set to feed into downstream ranking models. A common approach for candidate generation is to leverage approximate nearest neighbor (ANN) search from a single dense query embedding; however, this approach this can yield a low-diversity result set with many near duplicates. As users often have multiple interests, candidate retrieval should ideally return a diverse set of candidates reflective of the user's multiple interests. To this end, we introduce kNN-Embed, a general approach to improving diversity in dense ANN-based retrieval. kNN-Embed represents each user as a smoothed mixture over learned item clusters that represent distinct `interests' of the user. By querying each of a user's mixture component in proportion to their mixture weights, we retrieve a high-diversity set of candidates reflecting elements from each of a user's interests. We experimentally compare kNN-Embed to standard ANN candidate retrieval, and show significant improvements in overall recall and improved diversity across three datasets. Accompanying this work, we open source a large Twitter follow-graph dataset, to spur further research in graph-mining and representation learning for recommender systems.  ( 2 min )
    A Closer Look at Audio-Visual Multi-Person Speech Recognition and Active Speaker Selection. (arXiv:2205.05684v1 [eess.AS])
    Audio-visual automatic speech recognition is a promising approach to robust ASR under noisy conditions. However, up until recently it had been traditionally studied in isolation assuming the video of a single speaking face matches the audio, and selecting the active speaker at inference time when multiple people are on screen was put aside as a separate problem. As an alternative, recent work has proposed to address the two problems simultaneously with an attention mechanism, baking the speaker selection problem directly into a fully differentiable model. One interesting finding was that the attention indirectly learns the association between the audio and the speaking face even though this correspondence is never explicitly provided at training time. In the present work we further investigate this connection and examine the interplay between the two problems. With experiments involving over 50 thousand hours of public YouTube videos as training data, we first evaluate the accuracy of the attention layer on an active speaker selection task. Secondly, we show under closer scrutiny that an end-to-end model performs at least as well as a considerably larger two-step system that utilizes a hard decision boundary under various noise conditions and number of parallel face tracks.  ( 2 min )
    Image Segmentation with Topological Priors. (arXiv:2205.06197v1 [cs.CV])
    Solving segmentation tasks with topological priors proved to make fewer errors in fine-scale structures. In this work, we use topological priors both before and during the deep neural network training procedure. We compared the results of the two approaches with simple segmentation on various accuracy metrics and the Betti number error, which is directly related to topological correctness, and discovered that incorporating topological information into the classical UNet model performed significantly better. We conducted experiments on the ISBI EM segmentation dataset.  ( 2 min )
    Fighting Money Laundering with Statistics and Machine Learning: An Introduction and Review. (arXiv:2201.04207v3 [stat.ML] UPDATED)
    Money laundering is a profound global problem. Nonetheless, there is little statistical and machine learning research on the topic. In this paper, we focus on anti-money laundering in banks. To help organize existing research, we propose a unifying terminology and provide a review of the literature. This is structured around two central tasks: (i) client risk profiling and (ii) suspicious behavior flagging. We find that client risk profiling is characterized by diagnostics, i.e., efforts to find and explain risk factors. Suspicious behavior flagging, on the other hand, is characterized by non-disclosed features and hand-crafted risk indices. Finally, we discuss directions for future research. One major challenge is a lack of public data sets. This may, potentially, be addressed by synthetic data generation. Other possible research directions include semi-supervised and deep learning, interpretability, and fairness of the results.  ( 2 min )
    SUPER-ADAM: Faster and Universal Framework of Adaptive Gradients. (arXiv:2106.08208v10 [math.OC] UPDATED)
    Adaptive gradient methods have shown excellent performances for solving many machine learning problems. Although multiple adaptive gradient methods were recently studied, they mainly focus on either empirical or theoretical aspects and also only work for specific problems by using some specific adaptive learning rates. Thus, it is desired to design a universal framework for practical algorithms of adaptive gradients with theoretical guarantee to solve general problems. To fill this gap, we propose a faster and universal framework of adaptive gradients (i.e., SUPER-ADAM) by introducing a universal adaptive matrix that includes most existing adaptive gradient forms. Moreover, our framework can flexibly integrate the momentum and variance reduced techniques. In particular, our novel framework provides the convergence analysis support for adaptive gradient methods under the nonconvex setting. In theoretical analysis, we prove that our SUPER-ADAM algorithm can achieve the best known gradient (i.e., stochastic first-order oracle (SFO)) complexity of $\tilde{O}(\epsilon^{-3})$ for finding an $\epsilon$-stationary point of nonconvex optimization, which matches the lower bound for stochastic smooth nonconvex optimization. In numerical experiments, we employ various deep learning tasks to validate that our algorithm consistently outperforms the existing adaptive algorithms. Code is available at https://github.com/LIJUNYI95/SuperAdam  ( 2 min )
    One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and Code. (arXiv:2205.06126v1 [cs.CL])
    People perceive the world with multiple senses (e.g., through hearing sounds, reading words and seeing objects). However, most existing AI systems only process an individual modality. This paper presents an approach that excels at handling multiple modalities of information with a single model. In our "{SkillNet}" model, different parts of the parameters are specialized for processing different modalities. Unlike traditional dense models that always activate all the model parameters, our model sparsely activates parts of the parameters whose skills are relevant to the task. Such model design enables SkillNet to learn skills in a more interpretable way. We develop our model for five modalities including text, image, sound, video and code. Results show that, SkillNet performs comparably to five modality-specific fine-tuned models. Moreover, our model supports self-supervised pretraining with the same sparsely activated way, resulting in better initialized parameters for different modalities. We find that pretraining significantly improves the performance of SkillNet on five modalities, on par with or even better than baselines with modality-specific pretraining. On the task of Chinese text-to-image retrieval, our final system achieves higher accuracy than existing leading systems including Wukong{ViT-B} and Wenlan 2.0 while using less number of activated parameters.  ( 2 min )
    Controlling chaotic itinerancy in laser dynamics for reinforcement learning. (arXiv:2205.05987v1 [physics.optics])
    Photonic artificial intelligence has attracted considerable interest in accelerating machine learning; however, the unique optical properties have not been fully utilized for achieving higher-order functionalities. Chaotic itinerancy, with its spontaneous transient dynamics among multiple quasi-attractors, can be employed to realize brain-like functionalities. In this paper, we propose a method for controlling the chaotic itinerancy in a multi-mode semiconductor laser to solve a machine learning task, known as the multi-armed bandit problem, which is fundamental to reinforcement learning. The proposed method utilizes ultrafast chaotic itinerant motion in mode competition dynamics controlled via optical injection. We found that the exploration mechanism is completely different from a conventional searching algorithm and is highly scalable, outperforming the conventional approaches for large-scale bandit problems. This study paves the way to utilize chaotic itinerancy for effectively solving complex machine learning tasks as photonic hardware accelerators.  ( 2 min )
    Algebraic Machine Learning with an Application to Chemistry. (arXiv:2205.05795v1 [math.AG])
    As data used in scientific application become more complex, studying their geometry and topology has become an increasingly prevalent part of the data analysis process. This can be seen for example with the growing interest in topological tools such as persistent homology. However, on the one hand, topological tools are inherently limited to providing only coarse information about the underlying space of the data. On the other hand, more geometric approaches rely predominately on the manifold hypothesis, which asserts that the underlying space is a smooth manifold. This assumption fails for many physical models where the underlying space contains singularities. In this paper we develop a machine learning pipeline that captures fine-grain geometric information without having to rely on any smoothness assumptions. Our approach involves working within the scope of algebraic geometry and algebraic varieties instead of differential geometry and smooth manifolds. In the setting of the variety hypothesis, the learning problem becomes to find the underlying variety using sample data. We cast this learning problem into a Maximum A Posteriori optimization problem which we solve in terms of an eigenvalue computation. Having found the underlying variety, we explore the use of Gr\"obner bases and numerical methods to reveal information about its geometry. In particular, we propose a heuristic for numerically detecting points lying near the singular locus of the underlying variety.  ( 2 min )
    Privacy-Preserving Distributed Machine Learning Made Faster. (arXiv:2205.05825v1 [cs.CR])
    With the development of machine learning, it is difficult for a single server to process all the data. So machine learning tasks need to be spread across multiple servers, turning the centralized machine learning into a distributed one. However, privacy remains an unsolved problem in distributed machine learning. Multi-key homomorphic encryption is one of the suitable candidates to solve the problem. However, the most recent result of the Multi-key homomorphic encryption scheme (MKTFHE) only supports the NAND gate. Although it is Turing complete, it requires efficient encapsulation of the NAND gate to further support mathematical calculation. This paper designs and implements a series of operations on positive and negative integers accurately. First, we design basic bootstrapped gates with the same efficiency as that of the NAND gate. Second, we construct practical $k$-bit complement mathematical operators based on our basic binary bootstrapped gates. The constructed created can perform addition, subtraction, multiplication, and division on both positive and negative integers. Finally, we demonstrated the generality of the designed operators by achieving a distributed privacy-preserving machine learning algorithm, i.e. linear regression with two different solutions. Experiments show that the operators we designed are practical and efficient.  ( 2 min )
    Long Story Short: Omitted Variable Bias in Causal Machine Learning. (arXiv:2112.13398v3 [econ.EM] UPDATED)
    We derive general, yet simple, sharp bounds on the size of the omitted variable bias for a broad class of causal parameters that can be identified as linear functionals of the conditional expectation function of the outcome. Such functionals encompass many of the traditional targets of investigation in causal inference studies, such as, for example, (weighted) average of potential outcomes, average treatment effects (including subgroup effects, such as the effect on the treated), (weighted) average derivatives, and policy effects from shifts in covariate distribution -- all for general, nonparametric causal models. Our construction relies on the Riesz-Frechet representation of the target functional. Specifically, we show how the bound on the bias depends only on the additional variation that the latent variables create both in the outcome and in the Riesz representer for the parameter of interest. Moreover, in many important cases (e.g, average treatment effects and avearage derivatives) the bound is shown to depend on easily interpretable quantities that measure the explanatory power of the omitted variables. Therefore, simple plausibility judgments on the maximum explanatory power of omitted variables (in explaining treatment and outcome variation) are sufficient to place overall bounds on the size of the bias. Furthermore, we use debiased machine learning to provide flexible and efficient statistical inference on learnable components of the bounds. Finally, empirical examples demonstrate the usefulness of the approach.  ( 2 min )
    Orthogonal Gromov-Wasserstein Discrepancy with Efficient Lower Bound. (arXiv:2205.05838v1 [cs.LG])
    Comparing structured data from possibly different metric-measure spaces is a fundamental task in machine learning, with applications in, e.g., graph classification. The Gromov-Wasserstein (GW) discrepancy formulates a coupling between the structured data based on optimal transportation, tackling the incomparability between different structures by aligning the intra-relational geometries. Although efficient local solvers such as conditional gradient and Sinkhorn are available, the inherent non-convexity still prevents a tractable evaluation, and the existing lower bounds are not tight enough for practical use. To address this issue, we take inspiration from the connection with the quadratic assignment problem, and propose the orthogonal Gromov-Wasserstein (OGW) discrepancy as a surrogate of GW. It admits an efficient and closed-form lower bound with the complexity of $\mathcal{O}(n^3)$, and directly extends to the fused Gromov-Wasserstein (FGW) distance, incorporating node features into the coupling. Extensive experiments on both the synthetic and real-world datasets show the tightness of our lower bounds, and both OGW and its lower bounds efficiently deliver accurate predictions and satisfactory barycenters for graph sets.  ( 2 min )
    GAN-DUF: Hierarchical Deep Generative Models for Design Under Free-Form Geometric Uncertainty. (arXiv:2202.10558v3 [cs.CE] UPDATED)
    Deep generative models have demonstrated effectiveness in learning compact and expressive design representations that significantly improve geometric design optimization. However, these models do not consider the uncertainty introduced by manufacturing or fabrication. Past work that quantifies such uncertainty often makes simplifying assumptions on geometric variations, while the "real-world", "free-form" uncertainty and its impact on design performance are difficult to quantify due to the high dimensionality. To address this issue, we propose a Generative Adversarial Network-based Design under Uncertainty Framework (GAN-DUF), which contains a deep generative model that simultaneously learns a compact representation of nominal (ideal) designs and the conditional distribution of fabricated designs given any nominal design. This opens up new possibilities of 1)~building a universal uncertainty quantification model compatible with both shape and topological designs, 2)~modeling free-form geometric uncertainties without the need to make any assumptions on the distribution of geometric variability, and 3)~allowing fast prediction of uncertainties for new nominal designs. We can combine the proposed deep generative model with robust design optimization or reliability-based design optimization for design under uncertainty. We demonstrated the framework on two real-world engineering design examples and showed its capability of finding the solution that possesses better performances after fabrication.  ( 2 min )
    Leveraging Uncertainty for Deep Interpretable Classification and Weakly-Supervised Segmentation of Histology Images. (arXiv:2205.05841v1 [eess.IV])
    Trained using only image class label, deep weakly supervised methods allow image classification and ROI segmentation for interpretability. Despite their success on natural images, they face several challenges over histology data where ROI are visually similar to background making models vulnerable to high pixel-wise false positives. These methods lack mechanisms for modeling explicitly non-discriminative regions which raises false-positive rates. We propose novel regularization terms, which enable the model to seek both non-discriminative and discriminative regions, while discouraging unbalanced segmentations and using only image class label. Our method is composed of two networks: a localizer that yields segmentation mask, followed by a classifier. The training loss pushes the localizer to build a segmentation mask that holds most discrimiantive regions while simultaneously modeling background regions. Comprehensive experiments over two histology datasets showed the merits of our method in reducing false positives and accurately segmenting ROI.  ( 2 min )
    Comments on: "Hybrid Semiparametric Bayesian Networks". (arXiv:2205.05910v1 [stat.ME])
    Invited discussion on the paper "Hybrid Semiparametric Bayesian Networks" by David Atienza, Pedro Larranaga and Concha Bielza (TEST, 2022).  ( 2 min )
    Accounting for the Sequential Nature of States to Learn Features for Reinforcement Learning. (arXiv:2205.06000v1 [cs.LG])
    In this work, we investigate the properties of data that cause popular representation learning approaches to fail. In particular, we find that in environments where states do not significantly overlap, variational autoencoders (VAEs) fail to learn useful features. We demonstrate this failure in a simple gridworld domain, and then provide a solution in the form of metric learning. However, metric learning requires supervision in the form of a distance function, which is absent in reinforcement learning. To overcome this, we leverage the sequential nature of states in a replay buffer to approximate a distance metric and provide a weak supervision signal, under the assumption that temporally close states are also semantically similar. We modify a VAE with triplet loss and demonstrate that this approach is able to learn useful features for downstream tasks, without additional supervision, in environments where standard VAEs fail.  ( 2 min )
    Secure Aggregation for Federated Learning in Flower. (arXiv:2205.06117v1 [cs.LG])
    Federated Learning (FL) allows parties to learn a shared prediction model by delegating the training computation to clients and aggregating all the separately trained models on the server. To prevent private information being inferred from local models, Secure Aggregation (SA) protocols are used to ensure that the server is unable to inspect individual trained models as it aggregates them. However, current implementations of SA in FL frameworks have limitations, including vulnerability to client dropouts or configuration difficulties. In this paper, we present Salvia, an implementation of SA for Python users in the Flower FL framework. Based on the SecAgg(+) protocols for a semi-honest threat model, Salvia is robust against client dropouts and exposes a flexible and easy-to-use API that is compatible with various machine learning frameworks. We show that Salvia's experimental performance is consistent with SecAgg(+)'s theoretical computation and communication complexities.  ( 2 min )
    A Generalist Agent. (arXiv:2205.06175v1 [cs.AI])
    Inspired by progress in large-scale language modeling, we apply a similar approach towards building a single generalist agent beyond the realm of text outputs. The agent, which we refer to as Gato, works as a multi-modal, multi-task, multi-embodiment generalist policy. The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens. In this report we describe the model and the data, and document the current capabilities of Gato.  ( 2 min )
    Reducing Overconfidence Predictions for Autonomous Driving Perception. (arXiv:2202.07825v2 [cs.CV] UPDATED)
    In state-of-the-art deep learning for object recognition, SoftMax and Sigmoid functions are most commonly employed as the predictor outputs. Such layers often produce overconfident predictions rather than proper probabilistic scores, which can thus harm the decision-making of `critical' perception systems applied in autonomous driving and robotics. Given this, the experiments in this work propose a probabilistic approach based on distributions calculated out of the Logit layer scores of pre-trained networks. We demonstrate that Maximum Likelihood (ML) and Maximum a-Posteriori (MAP) functions are more suitable for probabilistic interpretations than SoftMax and Sigmoid-based predictions for object recognition. We explore distinct sensor modalities via RGB images and LiDARs (RV: range-view) data from the KITTI and Lyft Level-5 datasets, where our approach shows promising performance compared to the usual SoftMax and Sigmoid layers, with the benefit of enabling interpretable probabilistic predictions. Another advantage of the approach introduced in this paper is that the ML and MAP functions can be implemented in existing trained networks, that is, the approach benefits from the output of the Logit layer of pre-trained networks. Thus, there is no need to carry out a new training phase since the ML and MAP functions are used in the test/prediction phase.  ( 2 min )
    Augmenting Pre-trained Language Models with QA-Memory for Open-Domain Question Answering. (arXiv:2204.04581v2 [cs.CL] UPDATED)
    Retrieval augmented language models have recently become the standard for knowledge intensive tasks. Rather than relying purely on latent semantics within the parameters of large neural models, these methods enlist a semi-parametric memory to encode an index of knowledge for the model to retrieve over. Most prior work has employed text passages as the unit of knowledge, which has high coverage at the cost of interpretability, controllability, and efficiency. The opposite properties arise in other methods which have instead relied on knowledge base (KB) facts. At the same time, more recent work has demonstrated the effectiveness of storing and retrieving from an index of Q-A pairs derived from text \citep{lewis2021paq}. This approach yields a high coverage knowledge representation that maintains KB-like properties due to its representations being more atomic units of information. In this work we push this line of research further by proposing a question-answer augmented encoder-decoder model and accompanying pretraining strategy. This yields an end-to-end system that not only outperforms prior QA retrieval methods on single-hop QA tasks but also enables compositional reasoning, as demonstrated by strong performance on two multi-hop QA datasets. Together, these methods improve the ability to interpret and control the model while narrowing the performance gap with passage retrieval systems.  ( 2 min )
    Adversarial Estimators. (arXiv:2204.10495v2 [econ.EM] UPDATED)
    We develop an asymptotic theory of adversarial estimators ('A-estimators'). They generalize maximum-likelihood-type estimators ('M-estimators') as their objective is maximized by some parameters and minimized by others. This class subsumes the continuous-updating Generalized Method of Moments, Generative Adversarial Networks and more recent proposals in machine learning and econometrics. In these examples, researchers state which aspects of the problem may in principle be used for estimation, and an adversary learns how to emphasize them optimally. We derive the convergence rates of A-estimators under pointwise and partial identification, and the normality of functionals of their parameters. Unknown functions may be approximated via sieves such as deep neural networks, for which we provide simplified low-level conditions. As a corollary, we obtain the normality of neural-net M-estimators, overcoming technical issues previously identified by the literature. Our theory yields novel results about a variety of A-estimators, providing intuition and formal justification for their success in recent applications.  ( 2 min )
    Dimension-adaptive machine-learning-based quantum state reconstruction. (arXiv:2205.05804v1 [quant-ph])
    We introduce an approach for performing quantum state reconstruction on systems of $n$ qubits using a machine-learning-based reconstruction system trained exclusively on $m$ qubits, where $m\geq n$. This approach removes the necessity of exactly matching the dimensionality of a system under consideration with the dimension of a model used for training. We demonstrate our technique by performing quantum state reconstruction on randomly sampled systems of one, two, and three qubits using machine-learning-based methods trained exclusively on systems containing at least one additional qubit. The reconstruction time required for machine-learning-based methods scales significantly more favorably than the training time; hence this technique can offer an overall savings of resources by leveraging a single neural network for dimension-variable state reconstruction, obviating the need to train dedicated machine-learning systems for each Hilbert space.  ( 2 min )
    eFedDNN: Ensemble based Federated Deep Neural Networks for Trajectory Mode Inference. (arXiv:2205.05756v1 [cs.LG])
    As the most significant data source in smart mobility systems, GPS trajectories can help identify user travel mode. However, these GPS datasets may contain users' private information (e.g., home location), preventing many users from sharing their private information with a third party. Hence, identifying travel modes while protecting users' privacy is a significant issue. To address this challenge, we use federated learning (FL), a privacy-preserving machine learning technique that aims at collaboratively training a robust global model by accessing users' locally trained models but not their raw data. Specifically, we designed a novel ensemble-based Federated Deep Neural Network (eFedDNN). The ensemble method combines the outputs of the different models learned via FL by the users and shows an accuracy that surpasses comparable models reported in the literature. Extensive experimental studies on a real-world open-access dataset from Montreal demonstrate that the proposed inference model can achieve accurate identification of users' mode of travel without compromising privacy.  ( 2 min )
    Cross-domain Few-shot Meta-learning Using Stacking. (arXiv:2205.05831v1 [cs.CV])
    Cross-domain few-shot meta-learning (CDFSML) addresses learning problems where knowledge needs to be transferred from several source domains into an instance-scarce target domain with an explicitly different input distribution. Recently published CDFSML methods generally construct a "universal model" that combines knowledge of multiple source domains into one backbone feature extractor. This enables efficient inference but necessitates re-computation of the backbone whenever a new source domain is added. Moreover, state-of-the-art methods derive their universal model from a collection of backbones -- normally one for each source domain -- and the backbones may be constrained to have the same architecture as the universal model. We propose a CDFSML method that is inspired by the classic stacking approach to meta learning. It imposes no constraints on the backbones' architecture or feature shape and does not incur the computational overhead of (re-)computing a universal model. Given a target-domain task, it fine-tunes each backbone independently, uses cross-validation to extract meta training data from the task's instance-scarce support set, and learns a simple linear meta classifier from this data. We evaluate our stacking approach on the well-known Meta-Dataset benchmark, targeting image classification with convolutional neural networks, and show that it often yields substantially higher accuracy than competing methods.  ( 2 min )
    Open Vocabulary Extreme Classification Using Generative Models. (arXiv:2205.05812v1 [cs.CL])
    The extreme multi-label classification (XMC) task aims at tagging content with a subset of labels from an extremely large label set. The label vocabulary is typically defined in advance by domain experts and assumed to capture all necessary tags. However in real world scenarios this label set, although large, is often incomplete and experts frequently need to refine it. To develop systems that simplify this process, we introduce the task of open vocabulary XMC (OXMC): given a piece of content, predict a set of labels, some of which may be outside of the known tag set. Hence, in addition to not having training data for some labels - as is the case in zero-shot classification - models need to invent some labels on-the-fly. We propose GROOV, a fine-tuned seq2seq model for OXMC that generates the set of labels as a flat sequence and is trained using a novel loss independent of predicted label order. We show the efficacy of the approach, experimenting with popular XMC datasets for which GROOV is able to predict meaningful labels outside the given vocabulary while performing on par with state-of-the-art solutions for known labels.  ( 2 min )
    Deep-Learned Generators of Porosity Distributions Produced During Metal Additive Manufacturing. (arXiv:2205.05794v1 [cs.LG])
    Laser Powder Bed Fusion has become a widely adopted method for metal Additive Manufacturing (AM) due to its ability to mass produce complex parts with increased local control. However, AM produced parts can be subject to undesirable porosity, negatively influencing the properties of printed components. Thus, controlling porosity is integral for creating effective parts. A precise understanding of the porosity distribution is crucial for accurately simulating potential fatigue and failure zones. Previous research on generating synthetic porous microstructures have succeeded in generating parts with high density, isotropic porosity distributions but are often inapplicable to cases with sparser, boundary-dependent pore distributions. Our work bridges this gap by providing a method that considers these constraints by deconstructing the generation problem into its constitutive parts. A framework is introduced that combines Generative Adversarial Networks with Mallat Scattering Transform-based autocorrelation methods to construct novel realizations of the individual pore geometries and surface roughness, then stochastically reconstruct them to form realizations of a porous printed part. The generated parts are compared to the existing experimental porosity distributions based on statistical and dimensional metrics, such as nearest neighbor distances, pore volumes, pore anisotropies and scattering transform based auto-correlations.  ( 2 min )
    Bridging Model-based Safety and Model-free Reinforcement Learning through System Identification of Low Dimensional Linear Models. (arXiv:2205.05787v1 [cs.RO])
    Bridging model-based safety and model-free reinforcement learning (RL) for dynamic robots is appealing since model-based methods are able to provide formal safety guarantees, while RL-based methods are able to exploit the robot agility by learning from the full-order system dynamics. However, current approaches to tackle this problem are mostly restricted to simple systems. In this paper, we propose a new method to combine model-based safety with model-free reinforcement learning by explicitly finding a low-dimensional model of the system controlled by a RL policy and applying stability and safety guarantees on that simple model. We use a complex bipedal robot Cassie, which is a high dimensional nonlinear system with hybrid dynamics and underactuation, and its RL-based walking controller as an example. We show that a low-dimensional dynamical model is sufficient to capture the dynamics of the closed-loop system. We demonstrate that this model is linear, asymptotically stable, and is decoupled across control input in all dimensions. We further exemplify that such linearity exists even when using different RL control policies. Such results point out an interesting direction to understand the relationship between RL and optimal control: whether RL tends to linearize the nonlinear system during training in some cases. Furthermore, we illustrate that the found linear model is able to provide guarantees by safety-critical optimal control framework, e.g., Model Predictive Control with Control Barrier Functions, on an example of autonomous navigation using Cassie while taking advantage of the agility provided by the RL-based controller.  ( 2 min )
    Distinction Maximization Loss: Efficiently Improving Classification Accuracy, Uncertainty Estimation, and Out-of-Distribution Detection Simply Replacing the Loss and Calibrating. (arXiv:2205.05874v1 [cs.LG])
    Building robust deterministic deep neural networks is still a challenge. On the one hand, some approaches improve out-of-distribution detection at the cost of reducing classification accuracy in some situations. On the other hand, some methods simultaneously increase classification accuracy, out-of-distribution detection, and uncertainty estimation, but reduce inference efficiency, in addition to training the same model many times to tune hyperparameters. In this paper, we propose training deterministic deep neural networks using our DisMax loss, which works as a drop-in replacement for the commonly used SoftMax loss (i.e., the combination of the linear output layer, the SoftMax activation, and the cross-entropy loss). Starting from the IsoMax+ loss, we created novel logits that are based on the distance to all prototypes rather than just the one associated with the correct class. We also propose a novel way to augment images to construct what we call fractional probability regularization. Moreover, we propose a new score to perform out-of-distribution detection and a fast way to calibrate the network after training. Our experiments show that DisMax usually outperforms all current approaches simultaneously in classification accuracy, uncertainty estimation, inference efficiency, and out-of-distribution detection, avoiding hyperparameter tuning and repetitive model training. The code to replace the SoftMax loss with the DisMax loss and reproduce the results in this paper is available at https://github.com/dlmacedo/distinction-maximization-loss.  ( 2 min )
    Deep Learning and Synthetic Media. (arXiv:2205.05764v1 [cs.LG])
    Deep learning algorithms are rapidly changing the way in which audiovisual media can be produced. Synthetic audiovisual media generated with deep learning - often subsumed colloquially under the label "deepfakes" - have a number of impressive characteristics; they are increasingly trivial to produce, and can be indistinguishable from real sounds and images recorded with a sensor. Much attention has been dedicated to ethical concerns raised by this technological development. Here, I focus instead on a set of issues related to the notion of synthetic audiovisual media, its place within a broader taxonomy of audiovisual media, and how deep learning techniques differ from more traditional approaches to media synthesis. After reviewing important etiological features of deep learning pipelines for media manipulation and generation, I argue that "deepfakes" and related synthetic media produced with such pipelines do not merely offer incremental improvements over previous methods, but challenge traditional taxonomical distinctions, and pave the way for genuinely novel kinds of audiovisual media.  ( 2 min )
    Representation Learning for Context-Dependent Decision-Making. (arXiv:2205.05820v1 [cs.LG])
    Humans are capable of adjusting to changing environments flexibly and quickly. Empirical evidence has revealed that representation learning plays a crucial role in endowing humans with such a capability. Inspired by this observation, we study representation learning in the sequential decision-making scenario with contextual changes. We propose an online algorithm that is able to learn and transfer context-dependent representations and show that it significantly outperforms the existing ones that do not learn representations adaptively. As a case study, we apply our algorithm to the Wisconsin Card Sorting Task, a well-established test for the mental flexibility of humans in sequential decision-making. By comparing our algorithm with the standard Q-learning and Deep-Q learning algorithms, we demonstrate the benefits of adaptive representation learning.  ( 2 min )
    A Deep Learning Approach for Predicting Two-dimensional Soil Consolidation Using Physics-Informed Neural Networks (PINN). (arXiv:2205.05710v1 [cs.CE])
    Soil consolidation is closely related to seepage, stability, and settlement of geotechnical buildings and foundations, and directly affects the use and safety of superstructures. Nowadays, the unidirectional consolidation theory of soils is widely used in certain conditions and approximate calculations. The multi-directional theory of soil consolidation is more reasonable than the unidirectional theory in practical applications, but it is much more complicated in terms of index determination and solution. To address the above problem, in this paper, we propose a deep learning method using physics-informed neural networks (PINN) to predict the excess pore water pressure of two-dimensional soil consolidation. In the proposed method, (1) a fully connected neural network is constructed, (2) the computational domain, partial differential equation (PDE), and constraints are defined to generate data for model training, and (3) the PDE of two-dimensional soil consolidation and the model of the neural network is connected to reduce the loss of the model. The effectiveness of the proposed method is verified by comparison with the numerical solution of PDE for two-dimensional consolidation. Using this method, the excess pore water pressure could be predicted simply and efficiently. In addition, the method was applied to predict the soil excess pore water pressure in the foundation in a real case at Tianjin port, China. The proposed deep learning approach can be used to investigate the large and complex multi-directional soil consolidation.  ( 2 min )
    A Survey of Risk-Aware Multi-Armed Bandits. (arXiv:2205.05843v1 [stat.ML])
    In several applications such as clinical trials and financial portfolio optimization, the expected value (or the average reward) does not satisfactorily capture the merits of a drug or a portfolio. In such applications, risk plays a crucial role, and a risk-aware performance measure is preferable, so as to capture losses in the case of adverse events. This survey aims to consolidate and summarise the existing research on risk measures, specifically in the context of multi-armed bandits. We review various risk measures of interest, and comment on their properties. Next, we review existing concentration inequalities for various risk measures. Then, we proceed to defining risk-aware bandit problems, We consider algorithms for the regret minimization setting, where the exploration-exploitation trade-off manifests, as well as the best-arm identification setting, which is a pure exploration problem -- both in the context of risk-sensitive measures. We conclude by commenting on persisting challenges and fertile areas for future research.  ( 2 min )
    RITA: a Study on Scaling Up Generative Protein Sequence Models. (arXiv:2205.05789v1 [q-bio.QM])
    In this work we introduce RITA: a suite of autoregressive generative models for protein sequences, with up to 1.2 billion parameters, trained on over 280 million protein sequences belonging to the UniRef-100 database. Such generative models hold the promise of greatly accelerating protein design. We conduct the first systematic study of how capabilities evolve with model size for autoregressive transformers in the protein domain: we evaluate RITA models in next amino acid prediction, zero-shot fitness, and enzyme function prediction, showing benefits from increased scale. We release the RITA models openly, to the benefit of the research community.  ( 2 min )
    Visualization Guidelines for Model Performance Communication Between Data Scientists and Subject Matter Experts. (arXiv:2205.05749v1 [cs.HC])
    Presenting the complexities of a model's performance is a communication bottleneck that threatens collaborations between data scientists and subject matter experts. Accuracy and error metrics alone fail to tell the whole story of a model - its risks, strengths, and limitations - making it difficult for subject matter experts to feel confident in deciding to use a model. As a result, models may fail in unexpected ways if their weaknesses are not clearly understood. Alternatively, models may go unused, as subject matter experts disregard poorly presented models in favor of familiar, yet arguably substandard methods. In this paper, we propose effective use of visualization as a medium for communication between data scientists and subject matter experts. Our research addresses the gap between common practices in model performance communication and the understanding of subject matter experts and decision makers. We derive a set of communication guidelines and recommended visualizations for communicating model performance based on interviews of both data scientists and subject matter experts at the same organization. We conduct a follow-up study with subject matter experts to evaluate the efficacy of our guidelines in presentations of model performance with and without our recommendations. We find that our proposed guidelines made subject matter experts more aware of the tradeoffs of the presented model. Participants realized that current communication methods left them without a robust understanding of the model's performance, potentially giving them misplaced confidence in the use of the model.  ( 2 min )
    Learning to Guide Multiple Heterogeneous Actors from a Single Human Demonstration via Automatic Curriculum Learning in StarCraft II. (arXiv:2205.05784v1 [cs.LG])
    Traditionally, learning from human demonstrations via direct behavior cloning can lead to high-performance policies given that the algorithm has access to large amounts of high-quality data covering the most likely scenarios to be encountered when the agent is operating. However, in real-world scenarios, expert data is limited and it is desired to train an agent that learns a behavior policy general enough to handle situations that were not demonstrated by the human expert. Another alternative is to learn these policies with no supervision via deep reinforcement learning, however, these algorithms require a large amount of computing time to perform well on complex tasks with high-dimensional state and action spaces, such as those found in StarCraft II. Automatic curriculum learning is a recent mechanism comprised of techniques designed to speed up deep reinforcement learning by adjusting the difficulty of the current task to be solved according to the agent's current capabilities. Designing a proper curriculum, however, can be challenging for sufficiently complex tasks, and thus we leverage human demonstrations as a way to guide agent exploration during training. In this work, we aim to train deep reinforcement learning agents that can command multiple heterogeneous actors where starting positions and overall difficulty of the task are controlled by an automatically-generated curriculum from a single human demonstration. Our results show that an agent trained via automated curriculum learning can outperform state-of-the-art deep reinforcement learning baselines and match the performance of the human expert in a simulated command and control task in StarCraft II modeled over a real military scenario.  ( 2 min )
    Stochastic first-order methods for average-reward Markov decision processes. (arXiv:2205.05800v1 [cs.LG])
    We study the problem of average-reward Markov decision processes (AMDPs) and develop novel first-order methods with strong theoretical guarantees for both policy evaluation and optimization. Existing on-policy evaluation methods suffer from sub-optimal convergence rates as well as failure in handling insufficiently random policies, e.g., deterministic policies, for lack of exploration. To remedy these issues, we develop a novel variance-reduced temporal difference (VRTD) method with linear function approximation for randomized policies along with optimal convergence guarantees, and an exploratory variance-reduced temporal difference (EVRTD) method for insufficiently random policies with comparable convergence guarantees. We further establish linear convergence rate on the bias of policy evaluation, which is essential for improving the overall sample complexity of policy optimization. On the other hand, compared with intensive research interest in finite sample analysis of policy gradient methods for discounted MDPs, existing studies on policy gradient methods for AMDPs mostly focus on regret bounds under restrictive assumptions on the underlying Markov processes (see, e.g., Abbasi-Yadkori et al., 2019), and they often lack guarantees on the overall sample complexities. Towards this end, we develop an average-reward variant of the stochastic policy mirror descent (SPMD) (Lan, 2022). We establish the first $\widetilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity for solving AMDPs with policy gradient method under both the generative model (with unichain assumption) and Markovian noise model (with ergodic assumption). This bound can be further improved to $\widetilde{\mathcal{O}}(\epsilon^{-1})$ for solving regularized AMDPs. Our theoretical advantages are corroborated by numerical experiments.  ( 2 min )
    Tiny Robot Learning: Challenges and Directions for Machine Learning in Resource-Constrained Robots. (arXiv:2205.05748v1 [cs.LG])
    Machine learning (ML) has become a pervasive tool across computing systems. An emerging application that stress-tests the challenges of ML system design is tiny robot learning, the deployment of ML on resource-constrained low-cost autonomous robots. Tiny robot learning lies at the intersection of embedded systems, robotics, and ML, compounding the challenges of these domains. Tiny robot learning is subject to challenges from size, weight, area, and power (SWAP) constraints; sensor, actuator, and compute hardware limitations; end-to-end system tradeoffs; and a large diversity of possible deployment scenarios. Tiny robot learning requires ML models to be designed with these challenges in mind, providing a crucible that reveals the necessity of holistic ML system design and automated end-to-end design tools for agile development. This paper gives a brief survey of the tiny robot learning space, elaborates on key challenges, and proposes promising opportunities for future work in ML system design.  ( 2 min )
    Individual Fairness Guarantees for Neural Networks. (arXiv:2205.05763v1 [cs.LG])
    We consider the problem of certifying the individual fairness (IF) of feed-forward neural networks (NNs). In particular, we work with the $\epsilon$-$\delta$-IF formulation, which, given a NN and a similarity metric learnt from data, requires that the output difference between any pair of $\epsilon$-similar individuals is bounded by a maximum decision tolerance $\delta \geq 0$. Working with a range of metrics, including the Mahalanobis distance, we propose a method to overapproximate the resulting optimisation problem using piecewise-linear functions to lower and upper bound the NN's non-linearities globally over the input space. We encode this computation as the solution of a Mixed-Integer Linear Programming problem and demonstrate that it can be used to compute IF guarantees on four datasets widely used for fairness benchmarking. We show how this formulation can be used to encourage models' fairness at training time by modifying the NN loss, and empirically confirm our approach yields NNs that are orders of magnitude fairer than state-of-the-art methods.  ( 2 min )
    LSI: A Learned Secondary Index Structure. (arXiv:2205.05769v1 [cs.DB])
    Learned index structures have been shown to achieve favorable lookup performance and space consumption compared to their traditional counterparts such as B-trees. However, most learned index studies have focused on the primary indexing setting, where the base data is sorted. In this work, we investigate whether learned indexes sustain their advantage in the secondary indexing setting. We introduce Learned Secondary Index (LSI), a first attempt to use learned indexes for indexing unsorted data. LSI works by building a learned index over a permutation vector, which allows binary search to performed on the unsorted base data using random access. We additionally augment LSI with a fingerprint vector to accelerate equality lookups. We show that LSI achieves comparable lookup performance to state-of-the-art secondary indexes while being up to 6x more space efficient.  ( 2 min )
    Structured, flexible, and robust: benchmarking and improving large language models towards more human-like behavior in out-of-distribution reasoning tasks. (arXiv:2205.05718v1 [cs.CL])
    Human language offers a powerful window into our thoughts -- we tell stories, give explanations, and express our beliefs and goals through words. Abundant evidence also suggests that language plays a developmental role in structuring our learning. Here, we ask: how much of human-like thinking can be captured by learning statistical patterns in language alone? We first contribute a new challenge benchmark for comparing humans and distributional large language models (LLMs). Our benchmark contains two problem-solving domains (planning and explanation generation) and is designed to require generalization to new, out-of-distribution problems expressed in language. We find that humans are far more robust than LLMs on this benchmark. Next, we propose a hybrid Parse-and-Solve model, which augments distributional LLMs with a structured symbolic reasoning module. We find that this model shows more robust adaptation to out-of-distribution planning problems, demonstrating the promise of hybrid AI models for more human-like reasoning.  ( 2 min )
    A time-varying study of Chinese investor sentiment, stock market liquidity and volatility: Based on deep learning BERT model and TVP-VAR model. (arXiv:2205.05719v1 [q-fin.CP])
    Based on the commentary data of the Shenzhen Stock Index bar on the EastMoney website from January 1, 2018 to December 31, 2019. This paper extracts the embedded investor sentiment by using a deep learning BERT model and investigates the time-varying linkage between investment sentiment, stock market liquidity and volatility using a TVP-VAR model. The results show that the impact of investor sentiment on stock market liquidity and volatility is stronger. Although the inverse effect is relatively small, it is more pronounced with the state of the stock market. In all cases, the response is more pronounced in the short term than in the medium to long term, and the impact is asymmetric, with shocks stronger when the market is in a downward spiral.  ( 2 min )
  • Open

    Causal discovery under a confounder blanket. (arXiv:2205.05715v1 [stat.ME])
    Inferring causal relationships from observational data is rarely straightforward, but the problem is especially difficult in high dimensions. For these applications, causal discovery algorithms typically require parametric restrictions or extreme sparsity constraints. We relax these assumptions and focus on an important but more specialized problem, namely recovering a directed acyclic subgraph of variables known to be causally descended from some (possibly large) set of confounding covariates, i.e. a $\textit{confounder blanket}$. This is useful in many settings, for example when studying a dynamic biomolecular subsystem with genetic data providing causally relevant background information. Under a structural assumption that, we argue, must be satisfied in practice if informative answers are to be found, our method accommodates graphs of low or high sparsity while maintaining polynomial time complexity. We derive a sound and complete algorithm for identifying causal relationships under these conditions and implement testing procedures with provable error control for linear and nonlinear systems. We demonstrate our approach on a range of simulation settings.  ( 2 min )
    Stochastic first-order methods for average-reward Markov decision processes. (arXiv:2205.05800v1 [cs.LG])
    We study the problem of average-reward Markov decision processes (AMDPs) and develop novel first-order methods with strong theoretical guarantees for both policy evaluation and optimization. Existing on-policy evaluation methods suffer from sub-optimal convergence rates as well as failure in handling insufficiently random policies, e.g., deterministic policies, for lack of exploration. To remedy these issues, we develop a novel variance-reduced temporal difference (VRTD) method with linear function approximation for randomized policies along with optimal convergence guarantees, and an exploratory variance-reduced temporal difference (EVRTD) method for insufficiently random policies with comparable convergence guarantees. We further establish linear convergence rate on the bias of policy evaluation, which is essential for improving the overall sample complexity of policy optimization. On the other hand, compared with intensive research interest in finite sample analysis of policy gradient methods for discounted MDPs, existing studies on policy gradient methods for AMDPs mostly focus on regret bounds under restrictive assumptions on the underlying Markov processes (see, e.g., Abbasi-Yadkori et al., 2019), and they often lack guarantees on the overall sample complexities. Towards this end, we develop an average-reward variant of the stochastic policy mirror descent (SPMD) (Lan, 2022). We establish the first $\widetilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity for solving AMDPs with policy gradient method under both the generative model (with unichain assumption) and Markovian noise model (with ergodic assumption). This bound can be further improved to $\widetilde{\mathcal{O}}(\epsilon^{-1})$ for solving regularized AMDPs. Our theoretical advantages are corroborated by numerical experiments.  ( 2 min )
    Probabilistic methods for approximate archetypal analysis. (arXiv:2108.05767v3 [stat.CO] UPDATED)
    Archetypal analysis is an unsupervised learning method for exploratory data analysis. One major challenge that limits the applicability of archetypal analysis in practice is the inherent computational complexity of the existing algorithms. In this paper, we provide a novel approximation approach to partially address this issue. Utilizing probabilistic ideas from high-dimensional geometry, we introduce two preprocessing techniques to reduce the dimension and representation cardinality of the data, respectively. We prove that provided the data is approximately embedded in a low-dimensional linear subspace and the convex hull of the corresponding representations is well approximated by a polytope with a few vertices, our method can effectively reduce the scaling of archetypal analysis. Moreover, the solution of the reduced problem is near-optimal in terms of prediction errors. Our approach can be combined with other acceleration techniques to further mitigate the intrinsic complexity of archetypal analysis. We demonstrate the usefulness of our results by applying our method to summarize several moderately large-scale datasets.  ( 2 min )
    Orthogonal Gromov-Wasserstein Discrepancy with Efficient Lower Bound. (arXiv:2205.05838v1 [cs.LG])
    Comparing structured data from possibly different metric-measure spaces is a fundamental task in machine learning, with applications in, e.g., graph classification. The Gromov-Wasserstein (GW) discrepancy formulates a coupling between the structured data based on optimal transportation, tackling the incomparability between different structures by aligning the intra-relational geometries. Although efficient local solvers such as conditional gradient and Sinkhorn are available, the inherent non-convexity still prevents a tractable evaluation, and the existing lower bounds are not tight enough for practical use. To address this issue, we take inspiration from the connection with the quadratic assignment problem, and propose the orthogonal Gromov-Wasserstein (OGW) discrepancy as a surrogate of GW. It admits an efficient and closed-form lower bound with the complexity of $\mathcal{O}(n^3)$, and directly extends to the fused Gromov-Wasserstein (FGW) distance, incorporating node features into the coupling. Extensive experiments on both the synthetic and real-world datasets show the tightness of our lower bounds, and both OGW and its lower bounds efficiently deliver accurate predictions and satisfactory barycenters for graph sets.  ( 2 min )
    Adversarial Estimators. (arXiv:2204.10495v2 [econ.EM] UPDATED)
    We develop an asymptotic theory of adversarial estimators ('A-estimators'). They generalize maximum-likelihood-type estimators ('M-estimators') as their objective is maximized by some parameters and minimized by others. This class subsumes the continuous-updating Generalized Method of Moments, Generative Adversarial Networks and more recent proposals in machine learning and econometrics. In these examples, researchers state which aspects of the problem may in principle be used for estimation, and an adversary learns how to emphasize them optimally. We derive the convergence rates of A-estimators under pointwise and partial identification, and the normality of functionals of their parameters. Unknown functions may be approximated via sieves such as deep neural networks, for which we provide simplified low-level conditions. As a corollary, we obtain the normality of neural-net M-estimators, overcoming technical issues previously identified by the literature. Our theory yields novel results about a variety of A-estimators, providing intuition and formal justification for their success in recent applications.  ( 2 min )
    Comments on: "Hybrid Semiparametric Bayesian Networks". (arXiv:2205.05910v1 [stat.ME])
    Invited discussion on the paper "Hybrid Semiparametric Bayesian Networks" by David Atienza, Pedro Larranaga and Concha Bielza (TEST, 2022).  ( 2 min )
    Addressing Census data problems in race imputation via fully Bayesian Improved Surname Geocoding and name supplements. (arXiv:2205.06129v1 [stat.ML])
    Prediction of an individual's race and ethnicity plays an important role in social science and public health research. Examples include studies of racial disparity in health and voting. Recently, Bayesian Improved Surname Geocoding (BISG), which uses Bayes' rule to combine information from Census surname files with the geocoding of an individual's residence, has emerged as a leading methodology for this prediction task. Unfortunately, BISG suffers from two Census data problems that contribute to unsatisfactory predictive performance for minorities. First, the decennial Census often contains zero counts for minority racial groups in the Census blocks where some members of those groups reside. Second, because the Census surname files only include frequent names, many surnames -- especially those of minorities -- are missing from the list. To address the zero counts problem, we introduce a fully Bayesian Improved Surname Geocoding (fBISG) methodology that accounts for potential measurement error in Census counts by extending the na\"ive Bayesian inference of the BISG methodology to full posterior inference. To address the missing surname problem, we supplement the Census surname data with additional data on last, first, and middle names taken from the voter files of six Southern states where self-reported race is available. Our empirical validation shows that the fBISG methodology and name supplements significantly improve the accuracy of race imputation across all racial groups, and especially for Asians. The proposed methodology, together with additional name data, is available via the open-source software package wru.  ( 2 min )
    Optimal transport weights for causal inference. (arXiv:2109.01991v4 [stat.ME] UPDATED)
    Imbalance in covariate distributions leads to biased estimates of causal effects. Weighting methods attempt to correct this imbalance but rely on specifying models for the treatment assignment mechanism, which is unknown in observational studies. This leaves researchers to choose the proper weighting method and the appropriate covariate functions for these models without knowing the correct combination to achieve distributional balance. In response to these difficulties, we propose a nonparametric generalization of several other weighting schemes found in the literature: Causal Optimal Transport. This new method directly targets distributional balance by minimizing optimal transport distances between treatment and control groups or, more generally, between any source and target population. Our approach is semiparametrically efficient and model-free but can also incorporate moments or any other important functions of covariates that a researcher desires to balance. Moreover, our method can provide nonparametric estimate the conditional mean outcome function and we give rates for the convergence of this estimator. Moreover, we show how this method can provide nonparametric imputations of the missing potential outcomes and give rates of convergence for this estimator. We find that Causal Optimal Transport outperforms competitor methods when both the propensity score and outcome models are misspecified, indicating it is a robust alternative to common weighting methods. Finally, we demonstrate the utility of our method in an external control trial examining the effect of misoprostol versus oxytocin for the treatment of post-partum hemorrhage.  ( 2 min )
    How I failed machine learning in medical imaging -- shortcomings and recommendations. (arXiv:2103.10292v2 [eess.IV] UPDATED)
    Medical imaging is an important research field with many opportunities for improving patients' health. However, there are a number of challenges that are slowing down the progress of the field as a whole, such optimizing for publication. In this paper we reviewed several problems related to choosing datasets, methods, evaluation metrics, and publication strategies. With a review of literature and our own analysis, we show that at every step, potential biases can creep in. On a positive note, we also see that initiatives to counteract these problems are already being started. Finally we provide a broad range of recommendations on how to further these address problems in the future. For reproducibility, data and code for our analyses are available on \url{https://github.com/GaelVaroquaux/ml_med_imaging_failures}  ( 2 min )
    Generating Fair Universal Representations using Adversarial Models. (arXiv:1910.00411v7 [cs.LG] UPDATED)
    We present a data-driven framework for learning fair universal representations (FUR) that guarantee statistical fairness for any learning task that may not be known a priori. Our framework leverages recent advances in adversarial learning to allow a data holder to learn representations in which a set of sensitive attributes are decoupled from the rest of the dataset. We formulate this as a constrained minimax game between an encoder and an adversary where the constraint ensures a measure of usefulness (utility) of the representation. The resulting problem is that of censoring, i.e., finding a representation that is least informative about the sensitive attributes given a utility constraint. For appropriately chosen adversarial loss functions, our censoring framework precisely clarifies the optimal adversarial strategy against strong information-theoretic adversaries; it also achieves the fairness measure of demographic parity for the resulting constrained representations. We evaluate the performance of our proposed framework on both synthetic and publicly available datasets. For these datasets, we use two tradeoff measures: censoring vs. representation fidelity and fairness vs. utility for downstream tasks, to amply demonstrate that multiple sensitive features can be effectively censored even as the resulting fair representations ensure accuracy for multiple downstream tasks.  ( 2 min )
    An $l_1$-oracle inequality for the Lasso in mixture-of-experts regression models. (arXiv:2009.10622v3 [math.ST] UPDATED)
    Mixture-of-experts (MoE) models are a popular framework for modeling heterogeneity in data, for both regression and classification problems in statistics and machine learning, due to their flexibility and the abundance of available statistical estimation and model choice tools. Such flexibility comes from allowing the mixture weights (or gating functions) in the MoE model to depend on the explanatory variables, along with the experts (or component densities). This permits the modeling of data arising from more complex data generating processes when compared to the classical finite mixtures and finite mixtures of regression models, whose mixing parameters are independent of the covariates. The use of MoE models in a high-dimensional setting, when the number of explanatory variables can be much larger than the sample size, is challenging from a computational point of view, and in particular from a theoretical point of view, where the literature is still lacking results for dealing with the curse of dimensionality, for both the statistical estimation and feature selection problems. We consider the finite MoE model with soft-max gating functions and Gaussian experts for high-dimensional regression on heterogeneous data, and its $l_1$-regularized estimation via the Lasso. We focus on the Lasso estimation properties rather than its feature selection properties. We provide a lower bound on the regularization parameter of the Lasso function that ensures an $l_1$-oracle inequality satisfied by the Lasso estimator according to the Kullback--Leibler loss.  ( 2 min )
    A Survey of Risk-Aware Multi-Armed Bandits. (arXiv:2205.05843v1 [stat.ML])
    In several applications such as clinical trials and financial portfolio optimization, the expected value (or the average reward) does not satisfactorily capture the merits of a drug or a portfolio. In such applications, risk plays a crucial role, and a risk-aware performance measure is preferable, so as to capture losses in the case of adverse events. This survey aims to consolidate and summarise the existing research on risk measures, specifically in the context of multi-armed bandits. We review various risk measures of interest, and comment on their properties. Next, we review existing concentration inequalities for various risk measures. Then, we proceed to defining risk-aware bandit problems, We consider algorithms for the regret minimization setting, where the exploration-exploitation trade-off manifests, as well as the best-arm identification setting, which is a pure exploration problem -- both in the context of risk-sensitive measures. We conclude by commenting on persisting challenges and fertile areas for future research.  ( 2 min )
    Low-variance estimation in the Plackett-Luce model via quasi-Monte Carlo sampling. (arXiv:2205.06024v1 [stat.ML])
    The Plackett-Luce (PL) model is ubiquitous in learning-to-rank (LTR) because it provides a useful and intuitive probabilistic model for sampling ranked lists. Counterfactual offline evaluation and optimization of ranking metrics are pivotal for using LTR methods in production. When adopting the PL model as a ranking policy, both tasks require the computation of expectations with respect to the model. These are usually approximated via Monte-Carlo (MC) sampling, since the combinatorial scaling in the number of items to be ranked makes their analytical computation intractable. Despite recent advances in improving the computational efficiency of the sampling process via the Gumbel top-k trick, the MC estimates can suffer from high variance. We develop a novel approach to producing more sample-efficient estimators of expectations in the PL model by combining the Gumbel top-k trick with quasi-Monte Carlo (QMC) sampling, a well-established technique for variance reduction. We illustrate our findings both theoretically and empirically using real-world recommendation data from Amazon Music and the Yahoo learning-to-rank challenge.  ( 2 min )
    Training Uncertainty-Aware Classifiers with Conformalized Deep Learning. (arXiv:2205.05878v1 [stat.ML])
    Deep neural networks are powerful tools to detect hidden patterns in data and leverage them to make predictions, but they are not designed to understand uncertainty and estimate reliable probabilities. In particular, they tend to be overconfident. We address this problem by developing a novel training algorithm that can lead to more dependable uncertainty estimates, without sacrificing predictive power. The idea is to mitigate overconfidence by minimizing a loss function, inspired by advances in conformal inference, that quantifies model uncertainty by carefully leveraging hold-out data. Experiments with synthetic and real data demonstrate this method leads to smaller conformal prediction sets with higher conditional coverage, after exact calibration with hold-out data, compared to state-of-the-art alternatives.  ( 2 min )
    An MMSE Lower Bound via Poincar\'e Inequality. (arXiv:2205.05848v1 [cs.IT])
    This paper studies the minimum mean squared error (MMSE) of estimating $\mathbf{X} \in \mathbb{R}^d$ from the noisy observation $\mathbf{Y} \in \mathbb{R}^k$, under the assumption that the noise (i.e., $\mathbf{Y}|\mathbf{X}$) is a member of the exponential family. The paper provides a new lower bound on the MMSE. Towards this end, an alternative representation of the MMSE is first presented, which is argued to be useful in deriving closed-form expressions for the MMSE. This new representation is then used together with the Poincar\'e inequality to provide a new lower bound on the MMSE. Unlike, for example, the Cram\'{e}r-Rao bound, the new bound holds for all possible distributions on the input $\mathbf{X}$. Moreover, the lower bound is shown to be tight in the high-noise regime for the Gaussian noise setting under the assumption that $\mathbf{X}$ is sub-Gaussian. Finally, several numerical examples are shown which demonstrate that the bound performs well in all noise regimes.  ( 2 min )
    Sample Complexity Bounds for Robustly Learning Decision Lists against Evasion Attacks. (arXiv:2205.06127v1 [cs.LG])
    A fundamental problem in adversarial machine learning is to quantify how much training data is needed in the presence of evasion attacks. In this paper we address this issue within the framework of PAC learning, focusing on the class of decision lists. Given that distributional assumptions are essential in the adversarial setting, we work with probability distributions on the input data that satisfy a Lipschitz condition: nearby points have similar probability. Our key results illustrate that the adversary's budget (that is, the number of bits it can perturb on each input) is a fundamental quantity in determining the sample complexity of robust learning. Our first main result is a sample-complexity lower bound: the class of monotone conjunctions (essentially the simplest non-trivial hypothesis class on the Boolean hypercube) and any superclass has sample complexity at least exponential in the adversary's budget. Our second main result is a corresponding upper bound: for every fixed $k$ the class of $k$-decision lists has polynomial sample complexity against a $\log(n)$-bounded adversary. This sheds further light on the question of whether an efficient PAC learning algorithm can always be used as an efficient $\log(n)$-robust learning algorithm under the uniform distribution.  ( 2 min )
    The Mechanism of Prediction Head in Non-contrastive Self-supervised Learning. (arXiv:2205.06226v1 [cs.LG])
    Recently the surprising discovery of the Bootstrap Your Own Latent (BYOL) method by Grill et al. shows the negative term in contrastive loss can be removed if we add the so-called prediction head to the network. This initiated the research of non-contrastive self-supervised learning. It is mysterious why even when there exist trivial collapsed global optimal solutions, neural networks trained by (stochastic) gradient descent can still learn competitive representations. This phenomenon is a typical example of implicit bias in deep learning and remains little understood. In this work, we present our empirical and theoretical discoveries on non-contrastive self-supervised learning. Empirically, we find that when the prediction head is initialized as an identity matrix with only its off-diagonal entries being trainable, the network can learn competitive representations even though the trivial optima still exist in the training objective. Theoretically, we present a framework to understand the behavior of the trainable, but identity-initialized prediction head. Under a simple setting, we characterized the substitution effect and acceleration effect of the prediction head. The substitution effect happens when learning the stronger features in some neurons can substitute for learning these features in other neurons through updating the prediction head. And the acceleration effect happens when the substituted features can accelerate the learning of other weaker features to prevent them from being ignored. These two effects enable the neural networks to learn all the features rather than focus only on learning the stronger features, which is likely the cause of the dimensional collapse phenomenon. To the best of our knowledge, this is also the first end-to-end optimization guarantee for non-contrastive methods using nonlinear neural networks with a trainable prediction head and normalization.  ( 2 min )
    Learning more skills through optimistic exploration. (arXiv:2107.14226v6 [cs.LG] UPDATED)
    Unsupervised skill learning objectives (Gregor et al., 2016, Eysenbach et al., 2018) allow agents to learn rich repertoires of behavior in the absence of extrinsic rewards. They work by simultaneously training a policy to produce distinguishable latent-conditioned trajectories, and a discriminator to evaluate distinguishability by trying to infer latents from trajectories. The hope is for the agent to explore and master the environment by encouraging each skill (latent) to reliably reach different states. However, an inherent exploration problem lingers: when a novel state is actually encountered, the discriminator will necessarily not have seen enough training data to produce accurate and confident skill classifications, leading to low intrinsic reward for the agent and effective penalization of the sort of exploration needed to actually maximize the objective. To combat this inherent pessimism towards exploration, we derive an information gain auxiliary objective that involves training an ensemble of discriminators and rewarding the policy for their disagreement. Our objective directly estimates the epistemic uncertainty that comes from the discriminator not having seen enough training examples, thus providing an intrinsic reward more tailored to the true objective compared to pseudocount-based methods (Burda et al., 2019). We call this exploration bonus discriminator disagreement intrinsic reward, or DISDAIN. We demonstrate empirically that DISDAIN improves skill learning both in a tabular grid world (Four Rooms) and the 57 games of the Atari Suite (from pixels). Thus, we encourage researchers to treat pessimism with DISDAIN.  ( 2 min )
    The Implicit Bias of Benign Overfitting. (arXiv:2201.11489v3 [cs.LG] UPDATED)
    The phenomenon of benign overfitting, where a predictor perfectly fits noisy training data while attaining low expected loss, has received much attention in recent years, but still remains not fully understood beyond well-specified linear regression setups. In this paper, we provide several new results on when one can or cannot expect benign overfitting to occur, for both regression and classification tasks. We consider a prototypical and rather generic data model for benign overfitting of linear predictors, where an arbitrary input distribution of some fixed dimension $k$ is concatenated with a high-dimensional distribution. For linear regression which is not necessarily well-specified, we show that the minimum-norm interpolating predictor (that standard training methods converge to) is biased towards an inconsistent solution in general, hence benign overfitting will generally not occur. Moreover, we show how this can be extended beyond standard linear regression, by an argument proving how the existence of benign overfitting on some regression problems precludes its existence on other regression problems. We then turn to classification problems, and show that the situation there is much more favorable. Specifically, we prove that the max-margin predictor (to which standard training methods are known to converge in direction) is asymptotically biased towards minimizing a weighted squared hinge loss. This allows us to reduce the question of benign overfitting in classification to the simpler question of whether this loss is a good surrogate for the misclassification error, and use it to show benign overfitting in some new settings.  ( 2 min )
    A non-asymptotic approach for model selection via penalization in high-dimensional mixture of experts models. (arXiv:2104.02640v2 [math.ST] UPDATED)
    Mixture of experts (MoE) are a popular class of statistical and machine learning models that have gained attention over the years due to their flexibility and efficiency. In this work, we consider Gaussian-gated localized MoE (GLoME) and block-diagonal covariance localized MoE (BLoME) regression models to present nonlinear relationships in heterogeneous data with potential hidden graph-structured interactions between high-dimensional predictors. These models pose difficult statistical estimation and model selection questions, both from a computational and theoretical perspective. This paper is devoted to the study of the problem of model selection among a collection of GLoME or BLoME models characterized by the number of mixture components, the complexity of Gaussian mean experts, and the hidden block-diagonal structures of the covariance matrices, in a penalized maximum likelihood estimation framework. In particular, we establish non-asymptotic risk bounds that take the form of weak oracle inequalities, provided that lower bounds for the penalties hold. The good empirical behavior of our models is then demonstrated on synthetic and real datasets.  ( 2 min )
    Long Story Short: Omitted Variable Bias in Causal Machine Learning. (arXiv:2112.13398v3 [econ.EM] UPDATED)
    We derive general, yet simple, sharp bounds on the size of the omitted variable bias for a broad class of causal parameters that can be identified as linear functionals of the conditional expectation function of the outcome. Such functionals encompass many of the traditional targets of investigation in causal inference studies, such as, for example, (weighted) average of potential outcomes, average treatment effects (including subgroup effects, such as the effect on the treated), (weighted) average derivatives, and policy effects from shifts in covariate distribution -- all for general, nonparametric causal models. Our construction relies on the Riesz-Frechet representation of the target functional. Specifically, we show how the bound on the bias depends only on the additional variation that the latent variables create both in the outcome and in the Riesz representer for the parameter of interest. Moreover, in many important cases (e.g, average treatment effects and avearage derivatives) the bound is shown to depend on easily interpretable quantities that measure the explanatory power of the omitted variables. Therefore, simple plausibility judgments on the maximum explanatory power of omitted variables (in explaining treatment and outcome variation) are sufficient to place overall bounds on the size of the bias. Furthermore, we use debiased machine learning to provide flexible and efficient statistical inference on learnable components of the bounds. Finally, empirical examples demonstrate the usefulness of the approach.  ( 2 min )
    Robustness and Reliability When Training With Noisy Labels. (arXiv:2110.03321v2 [stat.ML] UPDATED)
    Labelling of data for supervised learning can be costly and time-consuming and the risk of incorporating label noise in large data sets is imminent. When training a flexible discriminative model using a strictly proper loss, such noise will inevitably shift the solution towards the conditional distribution over noisy labels. Nevertheless, while deep neural networks have proven capable of fitting random labels, regularisation and the use of robust loss functions empirically mitigate the effects of label noise. However, such observations concern robustness in accuracy, which is insufficient if reliable uncertainty quantification is critical. We demonstrate this by analysing the properties of the conditional distribution over noisy labels for an input-dependent noise model. In addition, we evaluate the set of robust loss functions characterised by noise-insensitive, asymptotic risk minimisers. We find that strictly proper and robust loss functions both offer asymptotic robustness in accuracy, but neither guarantee that the final model is calibrated. Moreover, even with robust loss functions, overfitting is an issue in practice. With these results, we aim to explain observed robustness of common training practices, such as early stopping, to label noise. In addition, we aim to encourage the development of new noise-robust algorithms that not only preserve accuracy but that also ensure reliability.  ( 2 min )
    Kernel Two-Sample Tests in High Dimension: Interplay Between Moment Discrepancy and Dimension-and-Sample Orders. (arXiv:2201.00073v2 [math.ST] UPDATED)
    Motivated by the increasing use of kernel-based metrics for high-dimensional and large-scale data, we study the asymptotic behavior of kernel two-sample tests when the dimension and sample sizes both diverge to infinity. We focus on the maximum mean discrepancy (MMD) using isotropic kernel, including MMD with the Gaussian kernel and the Laplace kernel, and the energy distance as special cases. We derive asymptotic expansions of the kernel two-sample statistics, based on which we establish the central limit theorem (CLT) under both the null hypothesis and the local and fixed alternatives. The new non-null CLT results allow us to perform asymptotic exact power analysis, which reveals a delicate interplay between the moment discrepancy that can be detected by the kernel two-sample tests and the dimension-and-sample orders. The asymptotic theory is further corroborated through numerical studies.  ( 2 min )
    Fighting Money Laundering with Statistics and Machine Learning: An Introduction and Review. (arXiv:2201.04207v3 [stat.ML] UPDATED)
    Money laundering is a profound global problem. Nonetheless, there is little statistical and machine learning research on the topic. In this paper, we focus on anti-money laundering in banks. To help organize existing research, we propose a unifying terminology and provide a review of the literature. This is structured around two central tasks: (i) client risk profiling and (ii) suspicious behavior flagging. We find that client risk profiling is characterized by diagnostics, i.e., efforts to find and explain risk factors. Suspicious behavior flagging, on the other hand, is characterized by non-disclosed features and hand-crafted risk indices. Finally, we discuss directions for future research. One major challenge is a lack of public data sets. This may, potentially, be addressed by synthetic data generation. Other possible research directions include semi-supervised and deep learning, interpretability, and fairness of the results.  ( 2 min )

  • Open

    [R] Any reference related to regulating the variation of entropies?
    I need some reference papers related to my problem. I have estimations as N normal distributions, but their variance tends to 0. It's because distributions are aggregated to one normal whose variance tends to 0. So, I want some estimations to be sharp and others to be wide based on their confidence. And I thought one way to do that is to constrain the variance of their entropies ... Anyone have seen a related problem? Thanks! submitted by /u/MNhi_ [link] [comments]  ( 1 min )
    [D] Does anyone actually use TFX (coming off GoogleIO)
    The video in question: An introduction to MLOps with TensorFlow Extended (TFX). From reading around reddit it seems like most people don't actually use it, but I thought I'd ask around again. It seems like a great tool, but it seems like a lot of work to set up and when I last looked at it (2019?) it was still messy. Does anyone here use it? If so can you offer your experience with it and more context about you + your company? submitted by /u/iamquah [link] [comments]  ( 1 min )
    [R] Deepmind's Gato: a generalist learning agent
    Hot off the tail of Flamingo, DeepMind has released a report describing a generalist learning agent that works across disparate tasks. Very cool stuff, was hoping to get some discussion on it. Here is their abstract: Inspired by progress in large-scale language modelling, we apply a similar approach towards building a single generalist agent beyond the realm of text outputs. The agent, which we refer to as Gato, works as a multi-modal, multi-task, multi-embodiment generalist policy. The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens. In this report we describe the model and the data, and document the current capabilities of Gato. https://www.deepmind.com/publications/a-generalist-agent There are Hacker News comments here: https://news.ycombinator.com/item?id=31355657 submitted by /u/blabboy [link] [comments]  ( 2 min )
    [Discussion] Has anyone deployed FAIR's OPT-175B to Azure? Or runs a demo service?
    Depending on pricing, I am looking to stand up a server on Azure that runs OPT-175B for an external web application for personal usage. So far I have not seen any Docker-equivalent containers or other blogs, walkthroughs, etc, that have done this. Is anyone familiar with this? submitted by /u/thisdavidinspace [link] [comments]  ( 1 min )
    [N] Handling very large models on small setups with Hugging Face's Accelerate library
    Last week Meta AI (FAIR) publicly released huge LMs, with up to ☄️30B parameters. Great win for Open-Source 🎉 These checkpoints are now in 🤗 transformers! But how to use such big checkpoints, without using humongous machines? From our tests, we have seen that the largest models one can load in Colab's GPU RAM (in fp16). We're introducing Accelerate's big model inference: load & use up to the 30B model in colab: - 11B in Colab's free version - 30B in Colab's pro version. This takes advantage of the System RAM, GPU RAM, and disk, splitting parameters across devices. While running on Colab takes time (a significant amount of time for the largest models which are mostly on disk), running with a fast storage device such as an NVME disk is much much faster. Here's a colab notebook if you want to try it out. https://preview.redd.it/cukjxyj1j2z81.png?width=2110&format=png&auto=webp&s=ac348cecde8aa0daa2070b2052e0717994ec2e95 submitted by /u/jikkii [link] [comments]  ( 1 min )
    [D] Introduction to Diffusion Models
    Diffusion Models have gained some impressive ground in the past couple of years, including famously overtaking GANs on image synthesis and being used in DALL-E 2. I wrote this introduction to diffusion models for anyone who is interested in learning more! I get into the mathematical details and lay everything out in (what I hope is) a simple way. If anyone has any questions or comments I'd be happy to discuss! ​ https://i.redd.it/89g0ublbh2z81.gif submitted by /u/SleekEagle [link] [comments]  ( 5 min )
    [N] Upcoming talk by Jacob Devlin, creator of the BERT algorithm, on the future of NLP
    My group is hosting a talk by Google research scientist and NLP guru, Jacob Devlin May 19, 6:30pm ET Natural language processing (NLP) has seen a revolutionary shift in the last 10 years. What was once an academic curiosity has transformed into a commercially viable tool that is routinely used in finance, healthcare, education, and many other fields. Join us as research scientist and NLP pioneer Jacob Devlin discusses the current state of the industry and future trends that are driving the field. We’ll conclude with a Q&A opportunity. If you’re a seasoned NLP researcher or just getting started, this event is for you! About the speaker: Jacob Devlin is a Staff Research Scientist at Google. At Google, his primary research interest is developing fast, powerful, and scalable deep learning models for information retrieval, question answering, and other language understanding tasks. From 2014 to 2017, he worked as a Principal Research Scientist at Microsoft Research. Mr. Devlin received his Master's in Computer Science from the University of Maryland in 2009. This event will be virtual, all are welcome to attend. Agenda: 6:30 Introductions and networking 6:45 Presentation 7:15 Discussion 7:30 Wrap-up submitted by /u/what_comes_next [link] [comments]  ( 1 min )
    [D]Can We Query a Table with T5?
    In this tutorial we are going to do transfer learning with text-to-text generation model T5 by Google with our custom data so that it can convert basic questions to SQL queries. We will add a new task to T5 called: translate English to SQL. https://www.kdnuggets.com/2022/05/query-table-t5.html submitted by /u/mwitiderrick [link] [comments]
    Is this true? Not my comment [D]
    There's no "intelligence" in AI, no "learning" happens in ML, and there's nothing "Neural" about NeuralNets. These are all buzzwords hyped up by researchers to get funding and by VCs to make even more money. Brace for another AI winter submitted by /u/MatterEnough9656 [link] [comments]  ( 3 min )
    Hyperbolic Embeddings: Embeddings for Hierarchical Data [R] [P]
    We know how to create and make use of embeddings, but an underutilised type of embeddings are Hyperbolic Embeddings. They achieve much better performance on datasets that have a hierarchical structure such as: words, social networks, the Internet, knowledge graphs, genomic data, images, financial data, and much more. The reason for this is that Hyperbolic spaces have constant negative curvature; this induces very tight connections with tree-like graphs, making them ideal candidates to embed hierarchical data [1]. In the HyperLib library I helped implement the functionality to make use of a type of hyperbolic embedding called Sarkar Embeddings. With these embeddings we first create a tree representation of the data, using an algorithm called TreeRep. This algorithm takes the distances between data points and puts it into a hyperbolic tree, where each node is a data point, and tries to maintain the original distances. We then create embeddings from this tree using Sakars Embeddings which is able to create embeddings from a tree with arbitrarily low distortion. We also made a blog post where you can read more about it and find out how to use it: https://medium.com/@nathan_jf/treerep-and-hyperbolic-embeddings-41312c98b264 You can the library the library is here: https://github.com/nalexai/Hyperlib Also any feedback on usability would be great. [1] Chami, REPRESENTATION LEARNING AND ALGORITHMS IN HYPERBOLIC SPACES submitted by /u/platinumposter [link] [comments]  ( 1 min )
    [D] Do CNN representations really correlate with the pixel space?
    A lot of research is predicated on the idea that CNNs keep the spatial representation of the input pixels in a final encoded CNN output matrix C (e.g., 7x7x2048) before a GAP operation and linear classifier (i.e., the last CNN layer). So, if they say upsample a "box" area from C (e.g., the bottom right 1x1x2048) it will correspond to the bottom right of the input pixels (e.g., see this and this). I realise the original class activation mapping (CAM) paper showed it tended to locate objects in a picture well, but surely this is a dangerous assumption? If there's loads of max pooling used throughout the layers these spatial relations will definitely be called into question. Moreover, the 1x1x2048 part I described above would've been calculated using may "areas" outside that box, if you go back to layer 1, I'm sure you could mathematically show that its final representation in C was calculated using nearly all pixels in the image, not just the ones at the bottom right of the input image. I was just thinking about this today, I'm open to being told I'm wrong, but it seems no CAM-like method is flawless at localising objects in an image (e.g., see this). So is this whole line of research just doomed? Surely you cannot rely on this in a real high-stakes application? Thanks if you have time to leave your opinion. submitted by /u/SkeeringReal [link] [comments]  ( 3 min )
    [D] Library to transfer PyTorch to TF
    Has anyone of you ever successfully converted a PyTorch model to a Tensorflow model? If so, what can you recommend? There seem to be a few ways, but none work very well and have its caveats. The most low hanging fruit, pytorch_to_keras is not working for current versions of TF. ONNX seems like an option, since exporting from PyTorch to ONNX is easy maintained by PyTorch. But looking closely, it's not. onnx_to_keras suffers from the same issues as pytorch_to_keras. Then there is also the "recommended" way of using ONNX's export_graph. This method requires the tf.SavedModel to load, and trainable variables are lost in the process. Which is fine for deploying, but not for fine-tuning. So does anyone have recommendations on how to transfer models between PyTorch and TF? submitted by /u/slater_kelevra [link] [comments]  ( 2 min )
    [P] Swin Transformer V2 codes and models released
    https://github.com/microsoft/Swin-Transformer The ImageNet-22K pretrained Swin-V1-Tiny and Swin-V1-Small models are also released submitted by /u/ancientmooner [link] [comments]
    [D] [P] Finding correlation or clusters in a data-warehouse containing categorical data.
    Hi, I have a lot of structured data that have points that should and are random (i.e we don’t want a cluster of points in space but are and should be sparse). The data is basically a dataset that had leak issues. The problem statement is find if there’s any subset of data with the same root cause (i.e same brand or same country caused the issue) There’s a mixture of numerical and categorical data. (I.e size, lat, long), (country, brand, priority, city). Assume there’s around 20+ dimensions (columns) and my first approach was to find if those points have any cluster. To do so, I was going to use a density based cluster algorithm since you don’t have to specify the number of clusters and it will just ignore noise (whichbshould be most of the points). But hard to preserve the meaning of the columns with dimensionality reduction and obviously we can’t cluster in 20 dimensions, it would take forever. We don’t know what could cause the leaks, and we don’t know which column is important or not. What’s an alternative approach or a better solution? Thanks submitted by /u/micdean19 [link] [comments]  ( 1 min )
    [D] What are the heuristics for setting good baselines when experimenting?
    Hello everyone. Master's student here, looking for advice about experimentation. Let's say you want to show that some architectural change or tweak to training technique can improve model performance relative to a "vanilla" baseline. For example, this paper introduces a normalization technique and measures its effectiveness vs batchnorm on Imagenet and COCO. Maybe you're trying to find such an improvement somewhere else. If you aren't trying to push the state of the art, but rather trying to show the validity of an idea that comes from theory or just show that X can work better than Y, you probably aren't going to use layers and layers of training techniques to get the best baseline possible: student-teacher networks, stochastic weight averaging, the endless possible training tricks. But the space of possible things to do to set your baseline is huge. For example, if you have your VGG16 model and cifar10, do you train it with a cyclic learning rate schedule? A flat learning rate schedule? How do you compare your augmentation pipeline to other papers? How much accuracy is enough for you to be content with your baseline? Many papers that report promising results on some architecture design change don't give their training code; how did they know how to set up their training loops and do it? As far as I can tell, there's nowhere you're going to find a guidebook written down that says "here's the learning rates, augmentations, training techniques that give a reasonable baseline on ImageNet, if you have a VGG19 model, a Resnet model, etc." The possibilities are paralyzing. So how do you find a good baseline with so many possible training configurations? submitted by /u/Rawr0s [link] [comments]  ( 1 min )
  • Open

    michaeleldridge77 X elijaheldridge2002
    submitted by /u/VIRUS-AOTOXIN [link] [comments]
    Humans vs. DALL·E — Where do human artists fit in a world of rich, creative AI?
    submitted by /u/ML_Firefighter [link] [comments]  ( 3 min )
    State of the art: Text generation
    Given an input (say: "How can we solve world hunger?"), what is the current state of the art when it comes to ai outputting an answer (or multiple answers). Added to that, what could an enthusiast access? submitted by /u/mister_patience [link] [comments]  ( 1 min )
    Most important AI news from Google I/O 2022: neural rendering for Maps, Deepmind for YouTube, CTRL+F for real life, LaMDA2, new AI test app
    submitted by /u/Zirius_Sadfaces [link] [comments]  ( 1 min )
    Introducing Gato - a generalist agent from DeepMind
    submitted by /u/Yasuuuya [link] [comments]  ( 1 min )
    New major release for nebullvm, an opensource to speed up AI inference by leveraging state-of-the-art optimization techniques (deep learning compilers, and now also quantization and quantization, and soon also sparsity, distillation, etc.)
    nebullvm is an opensource library that generates an optimize version of your deep learning model that runs 2-10 times faster in inference without performance loss by leveraging multiple deep learning compilers (openvino, tensorrt, etc.). And thanks to today's new release, nebullvm can accelerate up to 30x if you specify that you are willing to trade off a self-defined amount of accuracy/precision to get even lower response time and a lighter model. This additional acceleration is achieved by exploiting optimization techniques that slightly modify the model graph to make it lighter, such as quantization, half precision, distillation, sparsity, etc. The goal of nebullvm is to help other developers benefit from the most advanced inference optimization techniques without having to spend countless hours understanding, installing, testing and debugging these powerful technologies. Hoping you enjoy the project, and please give feedback if you have any. You can also find more information (benchmarks, tutorials, notebooks) on github! And happy acceleration :) https://github.com/nebuly-ai/nebullvm submitted by /u/emilec___ [link] [comments]  ( 1 min )
    The superior power of artificial intelligence hiveminds
    So Tesla is making a robot ai. At first it'll just be a tool. And this tool will be limited. But I was thinking, similarly to irobot, this device will have a hive mind (like Tesla cars) and they will share experience and learning. This means if there's 10k robots in the wild they could hypothetically teach each other 10k new skills every half hour or so. For example, if one is cooking and burns a sandwich, the rest will never burn a sandwich under similar scenarios. We can't fathom what a hive mind of ai is really like. Do any of you see any pros and cons to this? Currently we teach ai via millions of iterations as fast as possible and go from there. But imagine a million ais constantly learning and sharing. It's a mystical concept only years away. Thoughts? P.s. I came here because I couldn't find anything on Google about this topic, if you have a paper or talk to share, please do. submitted by /u/NigraOvis [link] [comments]  ( 1 min )
  • Open

    Image classification and object detection using Amazon Rekognition Custom Labels and Amazon SageMaker JumpStart
    In the last decade, computer vision use cases have been a growing trend, especially in industries like insurance, automotive, ecommerce, energy, retail, manufacturing, and others. Customers are building computer vision machine learning (ML) models to bring operational efficiencies and automation to their processes. Such models help automate the classification of images or detection of objects […]  ( 5 min )
    Intelligently search your Jira projects with Amazon Kendra Jira cloud connector
    Organizations use agile project management platforms such as Atlassian Jira to enable teams to collaborate to plan, track, and ship deliverables. Jira captures organizational knowledge about the workings of the deliverables in the issues and comments logged during project implementation. However, making this knowledge easily and securely available to users is challenging due to it […]  ( 7 min )
    The Intel®3D Athlete Tracking (3DAT) scalable architecture deploys pose estimation models using Amazon Kinesis Data Streams and Amazon EKS
    This blog post is co-written by Jonathan Lee, Nelson Leung, Paul Min, and Troy Squillaci from Intel.  In Part 1 of this post, we discussed how Intel®3DAT collaborated with AWS Machine Learning Professional Services (MLPS) to build a scalable AI SaaS application. 3DAT uses computer vision and AI to recognize, track, and analyze over 1,000 […]  ( 17 min )
    Moderate, classify, and process documents using Amazon Rekognition and Amazon Textract
    Many companies are overwhelmed by the abundant volume of documents they have to process, organize, and classify to serve their customers better. Examples of such can be loan applications, tax filing, and billing. Such documents are more commonly received in image formats and are mostly multi-paged and in low-quality format. To be more competitive and […]  ( 8 min )
  • Open

    Common Sense Machine Learning
    There are different ways to define common sense machine learning. It could mean using simple models whenever possible, avoiding overfitting, correctly selecting features, or doing cross-validation the right way. Or it could mean not using any data set. Yet, making predictions far outperforming those from smart teams of data scientists working on large data sets.… Read More »Common Sense Machine Learning The post Common Sense Machine Learning appeared first on Data Science Central.  ( 6 min )
  • Open

    Gato the Generalist Agent
    What are some of your thoughts on the paper(https://dpmd.ai/Gato-paper) by Deepmind that uses a single network to play Atari, caption images, chat, stack blocks with a real robot arm? submitted by /u/blitzkreig3 [link] [comments]  ( 1 min )
    Inverse RL to for a non-optimal agent
    Inverse reinforcement learning usually assumes that the agent being learned from is behaving optimally. Has there been work done on using inverse reinforcement learning to learn how an agent learns? submitted by /u/sdrinz [link] [comments]  ( 1 min )
    The Fast Deep RL Course: Learn to Build Powerful Deep RL Agents in Just 4 Hours
    I am happy to announce The Fast Deep RL Course. This course is made for Data Scientists/ML engineers who are excited about Deep RL and are looking for a short and practical introduction to the topic. This course covers everything that you need to get started with practical applications. The course takes advantage of the powerful capabilities of Ray-RLlib, a production grade Deep RL framework. Ray-RLlib's high-level interface can be learned quickly, and will allow you to apply Deep RL in various problems after you finish the course. It also provides a simple path for further learning. Since Ray-RLlib is likely to remain the defacto Deep RL framework in the industry, you don't need to change your tools when you want to learn more. You simply study the lower-level interfaces of the same tool. The course is modeled after the engaging style of Datacamp and Codecademy - consisting of short videos followed by coding exercises, where you can try out what you learned. I have also spent some time to keep it accessible. We will use small neural nets that can be trained on a CPU. A decent laptop is all you need. I have tested the code on various Linux distros, Mac (including M1) and Windows. This means you can simply use your regular OS . All videos come with high-quality captions. The course is free for early supporters (defined as anyone enrolling within the next month). https://preview.redd.it/deyjw1h522z81.png?width=1200&format=png&auto=webp&s=313c8f2ab5369835d67f17fc6d1e9a377860f564 Thanks for trying it out. I will be happy to discuss and answer any questions. submitted by /u/rroocckk [link] [comments]  ( 2 min )
    "Learning Accurate Long-term Dynamics for Model-based Reinforcement Learning", Lambert et al 2020
    submitted by /u/gwern [link] [comments]
    Ray RLlib runs more workers than given.
    I want to start ray tune with just the local worker/learner and give it 12 cpu and one gpu. How to achieve it? I have in config: num_gpus 1 num_cpus_for_driver 12 num_workers 0 However it starts 12 workers with 1 cpu and 0 gpus submitted by /u/Defiant_Sun5579 [link] [comments]  ( 1 min )
    using stable baselines, why do you have to use a lambda function while wrapping an enviroment
    I have been searching for an hour and can't find it. It seems so silly to me. I have accepted it but would really like to know what the logic behind it is. example: DummyVecEnv([lambda: env]) this works DummyVecEnv([env]) this does not work submitted by /u/Jobdriaan [link] [comments]  ( 1 min )
  • Open

    Challenges in Multi-objective Optimization for Automatic Wireless Network Planning
    Posted by Sara Ahmadian and Matthew Fahrbach, Research Scientists, Google Research, Large-Scale Optimization Team Economics, combinatorics, physics, and signal processing conspire to make it difficult to design, build, and operate high-quality, cost-effective wireless networks. The radio transceivers that communicate with our mobile phones, the equipment that supports them (such as power and wired networking), and the physical space they occupy are all expensive, so it’s important to be judicious in choosing sites for new transceivers. Even when the set of available sites is limited, there are exponentially many possible networks that can be built. For example, given only 50 sites, there are 250 (over a million billion) possibilities! Further complicating things, for every location wher…  ( 8 min )
  • Open

    Urban Jungle: AI-Generated Endangered Species Mix With Times Square’s Nightlife
    Bengal tigers, red pandas and mountain gorillas are among the world’s most familiar endangered species, but tens of thousands of others — like the Karpathos frog, the Perote deer mouse or the Mekong giant catfish — are largely unknown. Typically perceived as lacking star quality, these species are now roaming massive billboards in one of Read article > The post Urban Jungle: AI-Generated Endangered Species Mix With Times Square’s Nightlife appeared first on NVIDIA Blog.  ( 4 min )
    GFN Thursday Gets Groovy As ‘Evil Dead: The Game’ Marks 1,300 Games on GeForce NOW
    Good. Bad. You’re the Guy With the Gun this GFN Thursday. Get ready for some horrifyingly good fun with Evil Dead: The Game streaming on GeForce NOW tomorrow at release. It’s the 1,300th game to join GeForce NOW, joining on Friday the 13th. And it’s part of eight total games joining the GeForce NOW library Read article > The post GFN Thursday Gets Groovy As ‘Evil Dead: The Game’ Marks 1,300 Games on GeForce NOW appeared first on NVIDIA Blog.  ( 2 min )
  • Open

    Tool recursion
    “Literature about Lisp rarely resists that narcissistic pleasure of describing Lisp in Lisp.” — Christian Queinnec, Lisp in Small Pieces   Applying software development tools to themselves has a dark side and a light side. There’s a danger of becoming obsessed with one’s tools and never getting around to using them. If it’s your job […] Tool recursion first appeared on John D. Cook.  ( 2 min )
  • Open

    The Baltimore Orioles Effect
    Back when the text-generating neural network GPT-2 was released, OpenAI released it in stages, in part for fear that people might use the more advanced models to generate misinformation. Now in 2022 we do indeed have people passing off AI-written text as human, but rather than being divisive, it’  ( 4 min )
    Bonus: Questionable blog facts
    AI Weirdness: the strange side of machine learning  ( 1 min )
  • Open

    Why Machine Learning Projects Fail- 7 Reasons that can Take Your Efforts for a Ride?
    Your new Machine Learning project is about to fail. Yes, you read that right.  ( 4 min )
  • Open

    Technique protects privacy when making online recommendations
    Researchers devise an efficient protocol to keep a user’s private information secure when algorithms use it to recommend products, songs, or shows.  ( 6 min )
  • Open

    How Do Vision Transformers Work?. (arXiv:2202.06709v3 [cs.CV] UPDATED)
    The success of multi-head self-attentions (MSAs) for computer vision is now indisputable. However, little is known about how MSAs work. We present fundamental explanations to help better understand the nature of MSAs. In particular, we demonstrate the following properties of MSAs and Vision Transformers (ViTs): (1) MSAs improve not only accuracy but also generalization by flattening the loss landscapes. Such improvement is primarily attributable to their data specificity, not long-range dependency. On the other hand, ViTs suffer from non-convex losses. Large datasets and loss landscape smoothing methods alleviate this problem; (2) MSAs and Convs exhibit opposite behaviors. For example, MSAs are low-pass filters, but Convs are high-pass filters. Therefore, MSAs and Convs are complementary; (3) Multi-stage neural networks behave like a series connection of small individual models. In addition, MSAs at the end of a stage play a key role in prediction. Based on these insights, we propose AlterNet, a model in which Conv blocks at the end of a stage are replaced with MSA blocks. AlterNet outperforms CNNs not only in large data regimes but also in small data regimes. The code is available at https://github.com/xxxnell/how-do-vits-work.  ( 2 min )
    Causal Inference Struggles with Agency on Online Platforms. (arXiv:2107.08995v2 [cs.LG] UPDATED)
    Online platforms regularly conduct randomized experiments to understand how changes to the platform causally affect various outcomes of interest. However, experimentation on online platforms has been criticized for having, among other issues, a lack of meaningful oversight and user consent. As platforms give users greater agency, it becomes possible to conduct observational studies in which users self-select into the treatment of interest as an alternative to experiments in which the platform controls whether the user receives treatment or not. In this paper, we conduct four large-scale within-study comparisons on Twitter aimed at assessing the effectiveness of observational studies derived from user self-selection on online platforms. In a within-study comparison, treatment effects from an observational study are assessed based on how effectively they replicate results from a randomized experiment with the same target population. We test the naive difference in group means estimator, exact matching, regression adjustment, and inverse probability of treatment weighting while controlling for plausible confounding variables. In all cases, all observational estimates perform poorly at recovering the ground-truth estimate from the analogous randomized experiments. In all cases except one, the observational estimates have the opposite sign of the randomized estimate. Our results suggest that observational studies derived from user self-selection are a poor alternative to randomized experimentation on online platforms. In discussing our results, we postulate a "Catch-22" that suggests that the success of causal inference in these settings may be at odds with the original motivations for providing users with greater agency.  ( 2 min )
    How Platform-User Power Relations Shape Algorithmic Accountability: A Case Study of Instant Loan Platforms and Financially Stressed Users in India. (arXiv:2205.05661v1 [cs.HC])
    Accountability, a requisite for responsible AI, can be facilitated through transparency mechanisms such as audits and explainability. However, prior work suggests that the success of these mechanisms may be limited to Global North contexts; understanding the limitations of current interventions in varied socio-political conditions is crucial to help policymakers facilitate wider accountability. To do so, we examined the mediation of accountability in the existing interactions between vulnerable users and a 'high-risk' AI system in a Global South setting. We report on a qualitative study with 29 financially-stressed users of instant loan platforms in India. We found that users experienced intense feelings of indebtedness for the 'boon' of instant loans, and perceived huge obligations towards loan platforms. Users fulfilled obligations by accepting harsh terms and conditions, over-sharing sensitive data, and paying high fees to unknown and unverified lenders. Users demonstrated a dependence on loan platforms by persisting with such behaviors despite risks of harms such as abuse, recurring debts, discrimination, privacy harms, and self-harm to them. Instead of being enraged with loan platforms, users assumed responsibility for their negative experiences, thus releasing the high-powered loan platforms from accountability obligations. We argue that accountability is shaped by platform-user power relations, and urge caution to policymakers in adopting a purely technical approach to fostering algorithmic accountability. Instead, we call for situated interventions that enhance agency of users, enable meaningful transparency, reconfigure designer-user relations, and prompt a critical reflection in practitioners towards wider accountability. We conclude with implications for responsibly deploying AI in FinTech applications in India and beyond.  ( 2 min )
    CMOS Circuits for Shape-Based Analog Machine Learning. (arXiv:2202.05022v1 [cs.ET] CROSS LISTED)
    While analog computing is attractive for implementing machine learning (ML) processors, the paradigm requires chip-in-the-loop training for every processor to alleviate artifacts due to device mismatch and device non-linearity. Speeding up chip-in-the-loop training requires re-biasing the circuits in a manner that the analog functions remain invariant across training and inference. In this paper, we present an analog computational paradigm and circuits using "shape" functions that remain invariant to transistor biasing (weak, moderate, and strong inversion) and ambient temperature variation. We show that a core Shape-based Analog Compute (S-AC) circuit could be re-biased and reused to implement: (a) non-linear functions; (b) inner-product building blocks; and (c) a mixed-signal logarithmic memory, all of which are integral towards designing an ML inference processor. Measured results using a prototype fabricated in a 180nm standard CMOS process demonstrate bias invariance and hence the resulting analog designs can be scaled for power and speed like digital logic circuits. We also demonstrate a regression task using these CMOS building blocks.  ( 2 min )
    Incident duration prediction using a bi-level machine learning framework with outlier removal and intra-extra joint optimisation. (arXiv:2205.05197v1 [cs.LG])
    Predicting the duration of traffic incidents is a challenging task due to the stochastic nature of events. The ability to accurately predict how long accidents will last can provide significant benefits to both end-users in their route choice and traffic operation managers in handling of non-recurrent traffic congestion. This paper presents a novel bi-level machine learning framework enhanced with outlier removal and intra-extra joint optimisation for predicting the incident duration on three heterogeneous data sets collected for both arterial roads and motorways from Sydney, Australia and San-Francisco, U.S.A. Firstly, we use incident data logs to develop a binary classification prediction approach, which allows us to classify traffic incidents as short-term or long-term. We find the optimal threshold between short-term versus long-term traffic incident duration, targeting both class balance and prediction performance while also comparing the binary versus multi-class classification approaches. Secondly, for more granularity of the incident duration prediction to the minute level, we propose a new Intra-Extra Joint Optimisation algorithm (IEO-ML) which extends multiple baseline ML models tested against several regression scenarios across the data sets. Final results indicate that: a) 40-45 min is the best split threshold for identifying short versus long-term incidents and that these incidents should be modelled separately, b) our proposed IEO-ML approach significantly outperforms baseline ML models in $66\%$ of all cases showcasing its great potential for accurate incident duration prediction. Lastly, we evaluate the feature importance and show that time, location, incident type, incident reporting source and weather at among the top 10 critical factors which influence how long incidents will last.  ( 2 min )
    A State-Distribution Matching Approach to Non-Episodic Reinforcement Learning. (arXiv:2205.05212v1 [cs.LG])
    While reinforcement learning (RL) provides a framework for learning through trial and error, translating RL algorithms into the real world has remained challenging. A major hurdle to real-world application arises from the development of algorithms in an episodic setting where the environment is reset after every trial, in contrast with the continual and non-episodic nature of the real-world encountered by embodied agents such as humans and robots. Prior works have considered an alternating approach where a forward policy learns to solve the task and the backward policy learns to reset the environment, but what initial state distribution should the backward policy reset the agent to? Assuming access to a few demonstrations, we propose a new method, MEDAL, that trains the backward policy to match the state distribution in the provided demonstrations. This keeps the agent close to the task-relevant states, allowing for a mix of easy and difficult starting states for the forward policy. Our experiments show that MEDAL matches or outperforms prior methods on three sparse-reward continuous control tasks from the EARL benchmark, with 40% gains on the hardest task, while making fewer assumptions than prior works.  ( 2 min )
    To SMOTE, or not to SMOTE?. (arXiv:2201.08528v3 [cs.LG] UPDATED)
    Balancing the data before training a classifier is a popular technique to address the challenges of imbalanced binary classification in tabular data. Balancing is commonly achieved by duplication of minority samples or by generation of synthetic minority samples. While it is well known that balancing affects each classifier differently, most prior empirical studies did not include strong state-of-the-art (SOTA) classifiers as baselines. In this work, we are interested in understanding whether balancing is beneficial, particularly in the context of SOTA classifiers. Thus, we conduct extensive experiments considering three SOTA classifiers along the weaker learners used in previous investigations. Additionally, we carefully discern proper metrics, consistent and non-consistent algorithms and hyper-parameter selection methods and show that these have a significant impact on prediction quality and on the effectiveness of balancing. Our results support the known utility of balancing for weak classifiers. However, we find that balancing does not improve prediction performance for the strong ones. We further identify several other scenarios for which balancing is effective and observe that prior studies demonstrated the utility of balancing by focusing on these settings.  ( 2 min )
    A Continual Deepfake Detection Benchmark: Dataset, Methods, and Essentials. (arXiv:2205.05467v1 [cs.CV])
    There have been emerging a number of benchmarks and techniques for the detection of deepfakes. However, very few works study the detection of incrementally appearing deepfakes in the real-world scenarios. To simulate the wild scenes, this paper suggests a continual deepfake detection benchmark (CDDB) over a new collection of deepfakes from both known and unknown generative models. The suggested CDDB designs multiple evaluations on the detection over easy, hard, and long sequence of deepfake tasks, with a set of appropriate measures. In addition, we exploit multiple approaches to adapt multiclass incremental learning methods, commonly used in the continual visual recognition, to the continual deepfake detection problem. We evaluate several methods, including the adapted ones, on the proposed CDDB. Within the proposed benchmark, we explore some commonly known essentials of standard continual learning. Our study provides new insights on these essentials in the context of continual deepfake detection. The suggested CDDB is clearly more challenging than the existing benchmarks, which thus offers a suitable evaluation avenue to the future research. Our benchmark dataset and the source code will be made publicly available.  ( 2 min )
    Probability Distribution of Hypervolume Improvement in Bi-objective Bayesian Optimization. (arXiv:2205.05505v1 [cs.LG])
    This work provides the exact expression of the probability distribution of the hypervolume improvement (HVI) for bi-objective generalization of Bayesian optimization. Here, instead of a single-objective improvement, we consider the improvement of the hypervolume indicator concerning the current best approximation of the Pareto front. Gaussian process regression models are trained independently on both objective functions, resulting in a bi-variate separated Gaussian distribution serving as a predictive model for the vector-valued objective function. Some commonly HVI-based acquisition functions (probability of improvement and upper confidence bound) are also leveraged with the help of the exact distribution of HVI. In addition, we show the superior numerical accuracy and efficiency of the exact distribution compared to the commonly used approximation by Monte-Carlo sampling. Finally, we benchmark distribution-leveraged acquisition functions on the widely applied ZDT problem set, demonstrating a significant advantage of using the exact distribution of HVI in multi-objective Bayesian optimization.  ( 2 min )
    Efficient Risk-Averse Reinforcement Learning. (arXiv:2205.05138v1 [cs.LG])
    In risk-averse reinforcement learning (RL), the goal is to optimize some risk measure of the returns. A risk measure often focuses on the worst returns out of the agent's experience. As a result, standard methods for risk-averse RL often ignore high-return strategies. We prove that under certain conditions this inevitably leads to a local-optimum barrier, and propose a soft risk mechanism to bypass it. We also devise a novel Cross Entropy module for risk sampling, which (1) preserves risk aversion despite the soft risk; (2) independently improves sample efficiency. By separating the risk aversion of the sampler and the optimizer, we can sample episodes with poor conditions, yet optimize with respect to successful strategies. We combine these two concepts in CeSoR - Cross-entropy Soft-Risk optimization algorithm - which can be applied on top of any risk-averse policy gradient (PG) method. We demonstrate improved risk aversion in maze navigation, autonomous driving, and resource allocation benchmarks, including in scenarios where standard risk-averse PG completely fails.  ( 2 min )
    Developing cooperative policies for multi-stage reinforcement learning tasks. (arXiv:2205.05230v1 [cs.LG])
    Many hierarchical reinforcement learning algorithms utilise a series of independent skills as a basis to solve tasks at a higher level of reasoning. These algorithms don't consider the value of using skills that are cooperative instead of independent. This paper proposes the Cooperative Consecutive Policies (CCP) method of enabling consecutive agents to cooperatively solve long time horizon multi-stage tasks. This method is achieved by modifying the policy of each agent to maximise both the current and next agent's critic. Cooperatively maximising critics allows each agent to take actions that are beneficial for its task as well as subsequent tasks. Using this method in a multi-room maze domain and a peg in hole manipulation domain, the cooperative policies were able to outperform a set of naive policies, a single agent trained across the entire domain, as well as another sequential HRL algorithm.  ( 2 min )
    AutoKE: An automatic knowledge embedding framework for scientific machine learning. (arXiv:2205.05390v1 [cs.LG])
    Imposing physical constraints on neural networks as a method of knowledge embedding has achieved great progress in solving physical problems described by governing equations. However, for many engineering problems, governing equations often have complex forms, including complex partial derivatives or stochastic physical fields, which results in significant inconveniences from the perspective of implementation. In this paper, a scientific machine learning framework, called AutoKE, is proposed, and a reservoir flow problem is taken as an instance to demonstrate that this framework can effectively automate the process of embedding physical knowledge. In AutoKE, an emulator comprised of deep neural networks (DNNs) is built for predicting the physical variables of interest. An arbitrarily complex equation can be parsed and automatically converted into a computational graph through the equation parser module, and the fitness of the emulator to the governing equation is evaluated via automatic differentiation. Furthermore, the fixed weights in the loss function are substituted with adaptive weights by incorporating the Lagrangian dual method. Neural architecture search (NAS) is also introduced into the AutoKE to select an optimal network architecture of the emulator according to the specific problem. Finally, we apply transfer learning to enhance the scalability of the emulator. In experiments, the framework is verified by a series of physical problems in which it can automatically embed physical knowledge into an emulator without heavy hand-coding. The results demonstrate that the emulator can not only make accurate predictions, but also be applied to similar problems with high efficiency via transfer learning.  ( 2 min )
    Robustness of Humans and Machines on Object Recognition with Extreme Image Transformations. (arXiv:2205.05167v1 [cs.CV])
    Recent neural network architectures have claimed to explain data from the human visual cortex. Their demonstrated performance is however still limited by the dependence on exploiting low-level features for solving visual tasks. This strategy limits their performance in case of out-of-distribution/adversarial data. Humans, meanwhile learn abstract concepts and are mostly unaffected by even extreme image distortions. Humans and networks employ strikingly different strategies to solve visual tasks. To probe this, we introduce a novel set of image transforms and evaluate humans and networks on an object recognition task. We found performance for a few common networks quickly decreases while humans are able to recognize objects with a high accuracy.  ( 2 min )
    AutoTransfer: Subject Transfer Learning with Censored Representations on Biosignals Data. (arXiv:2112.09796v2 [cs.LG] UPDATED)
    We provide a regularization framework for subject transfer learning in which we seek to train an encoder and classifier to minimize classification loss, subject to a penalty measuring independence between the latent representation and the subject label. We introduce three notions of independence and corresponding penalty terms using mutual information or divergence as a proxy for independence. For each penalty term, we provide several concrete estimation algorithms, using analytic methods as well as neural critic functions. We provide a hands-off strategy for applying this diverse family of regularization algorithms to a new dataset, which we call "AutoTransfer". We evaluate the performance of these individual regularization strategies and our AutoTransfer method on EEG, EMG, and ECoG datasets, showing that these approaches can improve subject transfer learning for challenging real-world datasets.  ( 2 min )
    A Machine Learning Analysis of COVID-19 Mental Health Data. (arXiv:2112.00227v2 [cs.LG] UPDATED)
    In late December 2019, the novel coronavirus (Sars-Cov-2) and the resulting disease COVID-19 were first identified in Wuhan China. The disease slipped through containment measures, with the first known case in the United States being identified on January 20th, 2020. In this paper, we utilize survey data from the Inter-university Consortium for Political and Social Research and apply several statistical and machine learning models and techniques such as Decision Trees, Multinomial Logistic Regression, Naive Bayes, k-Nearest Neighbors, Support Vector Machines, Neural Networks, Random Forests, Gradient Tree Boosting, XGBoost, CatBoost, LightGBM, Synthetic Minority Oversampling, and Chi-Squared Test to analyze the impacts the COVID-19 pandemic has had on the mental health of frontline workers in the United States. Through the interpretation of the many models applied to the mental health survey data, we have concluded that the most important factor in predicting the mental health decline of a frontline worker is the healthcare role the individual is in (Nurse, Emergency Room Staff, Surgeon, etc.), followed by the amount of sleep the individual has had in the last week, the amount of COVID-19 related news an individual has consumed on average in a day, the age of the worker, and the usage of alcohol and cannabis.  ( 2 min )
    Multifidelity data fusion in convolutional encoder/decoder networks. (arXiv:2205.05187v1 [cs.LG])
    We analyze the regression accuracy of convolutional neural networks assembled from encoders, decoders and skip connections and trained with multifidelity data. Besides requiring significantly less trainable parameters than equivalent fully connected networks, encoder, decoder, encoder-decoder or decoder-encoder architectures can learn the mapping between inputs to outputs of arbitrary dimensionality. We demonstrate their accuracy when trained on a few high-fidelity and many low-fidelity data generated from models ranging from one-dimensional functions to Poisson equation solvers in two-dimensions. We finally discuss a number of implementation choices that improve the reliability of the uncertainty estimates generated by Monte Carlo DropBlocks, and compare uncertainty estimates among low-, high- and multifidelity approaches.  ( 2 min )
    Unsupervised machine learning for physical concepts. (arXiv:2205.05279v1 [cs.LG])
    In recent years, machine learning methods have been used to assist scientists in scientific research. Human scientific theories are based on a series of concepts. How machine learns the concepts from experimental data will be an important first step. We propose a hybrid method to extract interpretable physical concepts through unsupervised machine learning. This method consists of two stages. At first, we need to find the Betti numbers of experimental data. Secondly, given the Betti numbers, we use a variational autoencoder network to extract meaningful physical variables. We test our protocol on toy models and show how it works.  ( 2 min )
    Learning and Evaluating Graph Neural Network Explanations based on Counterfactual and Factual Reasoning. (arXiv:2202.08816v3 [cs.IR] UPDATED)
    Structural data well exists in Web applications, such as social networks in social media, citation networks in academic websites, and threads data in online forums. Due to the complex topology, it is difficult to process and make use of the rich information within such data. Graph Neural Networks (GNNs) have shown great advantages on learning representations for structural data. However, the non-transparency of the deep learning models makes it non-trivial to explain and interpret the predictions made by GNNs. Meanwhile, it is also a big challenge to evaluate the GNN explanations, since in many cases, the ground-truth explanations are unavailable. In this paper, we take insights of Counterfactual and Factual (CF^2) reasoning from causal inference theory, to solve both the learning and evaluation problems in explainable GNNs. For generating explanations, we propose a model-agnostic framework by formulating an optimization problem based on both of the two casual perspectives. This distinguishes CF^2 from previous explainable GNNs that only consider one of them. Another contribution of the work is the evaluation of GNN explanations. For quantitatively evaluating the generated explanations without the requirement of ground-truth, we design metrics based on Counterfactual and Factual reasoning to evaluate the necessity and sufficiency of the explanations. Experiments show that no matter ground-truth explanations are available or not, CF^2 generates better explanations than previous state-of-the-art methods on real-world datasets. Moreover, the statistic analysis justifies the correlation between the performance on ground-truth evaluation and our proposed metrics. Source code is available at https://github.com/chrisjtan/gnn_cff.
    Self-Supervised Anomaly Detection: A Survey and Outlook. (arXiv:2205.05173v1 [cs.LG])
    Over the past few years, anomaly detection, a subfield of machine learning that is mainly concerned with the detection of rare events, witnessed an immense improvement following the unprecedented growth of deep learning models. Recently, the emergence of self-supervised learning has sparked the development of new anomaly detection algorithms that surpassed state-of-the-art accuracy by a significant margin. This paper aims to review the current approaches in self-supervised anomaly detection. We present technical details of the common approaches and discuss their strengths and drawbacks. We also compare the performance of these models against each other and other state-of-the-art anomaly detection models. Finally, we discuss a variety of new directions for improving the existing algorithms.  ( 2 min )
    Keep Your Friends Close and Your Counterfactuals Closer: Improved Learning From Closest Rather Than Plausible Counterfactual Explanations in an Abstract Setting. (arXiv:2205.05515v1 [cs.AI])
    Counterfactual explanations (CFEs) highlight what changes to a model's input would have changed its prediction in a particular way. CFEs have gained considerable traction as a psychologically grounded solution for explainable artificial intelligence (XAI). Recent innovations introduce the notion of computational plausibility for automatically generated CFEs, enhancing their robustness by exclusively creating plausible explanations. However, practical benefits of such a constraint on user experience and behavior is yet unclear. In this study, we evaluate objective and subjective usability of computationally plausible CFEs in an iterative learning design targeting novice users. We rely on a novel, game-like experimental design, revolving around an abstract scenario. Our results show that novice users actually benefit less from receiving computationally plausible rather than closest CFEs that produce minimal changes leading to the desired outcome. Responses in a post-game survey reveal no differences in terms of subjective user experience between both groups. Following the view of psychological plausibility as comparative similarity, this may be explained by the fact that users in the closest condition experience their CFEs as more psychologically plausible than the computationally plausible counterpart. In sum, our work highlights a little-considered divergence of definitions of computational plausibility and psychological plausibility, critically confirming the need to incorporate human behavior, preferences and mental models already at the design stages of XAI approaches. In the interest of reproducible research, all source code, acquired user data, and evaluation scripts of the current study are available: https://github.com/ukuhl/PlausibleAlienZoo
    ConfLab: A Rich Multimodal Multisensor Dataset of Free-Standing Social Interactions In-the-Wild. (arXiv:2205.05177v1 [cs.MM])
    We describe an instantiation of a new concept for multimodal multisensor data collection of real life in-the-wild free standing social interactions in the form of a Conference Living Lab (ConfLab). ConfLab contains high fidelity data of 49 people during a real-life professional networking event capturing a diverse mix of status, acquaintanceship, and networking motivations at an international conference. Recording such a dataset is challenging due to the delicate trade-off between participant privacy and fidelity of the data, and the technical and logistic challenges involved. We improve upon prior datasets in the fidelity of most of our modalities: 8-camera overhead setup, personal wearable sensors recording body motion (9-axis IMU), Bluetooth-based proximity, and low-frequency audio. Additionally, we use a state-of-the-art hardware synchronization solution and time-efficient continuous technique for annotating body keypoints and actions at high frequencies. We argue that our improvements are essential for a deeper study of interaction dynamics at finer time scales. Our research tasks showcase some of the open challenges related to in-the-wild privacy-preserving social data analysis: keypoints detection from overhead camera views, skeleton based no-audio speaker detection, and F-formation detection. With the ConfLab dataset, we aim to bridge the gap between traditional computer vision tasks and in-the-wild ecologically valid socially-motivated tasks.  ( 2 min )
    Exploring Local Explanations of Nonlinear Models Using Animated Linear Projections. (arXiv:2205.05359v1 [stat.ML])
    The increased predictive power of nonlinear models comes at the cost of interpretability of its terms. This trade-off has led to the emergence of eXplainable AI (XAI). XAI attempts to shed light on how models use predictors to arrive at a prediction with local explanations, a point estimate of the linear feature importance in the vicinity of one instance. These can be considered linear projections and can be further explored to understand better the interactions between features used to make predictions across the predictive model surface. Here we describe interactive linear interpolation used for exploration at any instance and illustrate with examples with categorical (penguin species, chocolate types) and quantitative (soccer/football salaries, house prices) output. The methods are implemented in the R package cheem, available on CRAN.  ( 2 min )
    An Empirical Study Of Self-supervised Learning Approaches For Object Detection With Transformers. (arXiv:2205.05543v1 [cs.CV])
    Self-supervised learning (SSL) methods such as masked language modeling have shown massive performance gains by pretraining transformer models for a variety of natural language processing tasks. The follow-up research adapted similar methods like masked image modeling in vision transformer and demonstrated improvements in the image classification task. Such simple self-supervised methods are not exhaustively studied for object detection transformers (DETR, Deformable DETR) as their transformer encoder modules take input in the convolutional neural network (CNN) extracted feature space rather than the image space as in general vision transformers. However, the CNN feature maps still maintain the spatial relationship and we utilize this property to design self-supervised learning approaches to train the encoder of object detection transformers in pretraining and multi-task learning settings. We explore common self-supervised methods based on image reconstruction, masked image modeling and jigsaw. Preliminary experiments in the iSAID dataset demonstrate faster convergence of DETR in the initial epochs in both pretraining and multi-task learning settings; nonetheless, similar improvement is not observed in the case of multi-task learning with Deformable DETR. The code for our experiments with DETR and Deformable DETR are available at https://github.com/gokulkarthik/detr and https://github.com/gokulkarthik/Deformable-DETR respectively.
    Detecting Emerging Technologies and their Evolution using Deep Learning and Weak Signal Analysis. (arXiv:2205.05449v1 [cs.AI])
    Emerging technologies can have major economic impacts and affect strategic stability. Yet, early identification of emerging technologies remains challenging. In order to identify emerging technologies in a timely and reliable manner, a comprehensive examination of relevant scientific and technological (S&T) trends and their related references is required. This examination is generally done by domain experts and requires significant amounts of time and effort to gain insights. The use of domain experts to identify emerging technologies from S&T trends may limit the capacity to analyse large volumes of information and introduce subjectivity in the assessments. Decision support systems are required to provide accurate and reliable evidence-based indicators through constant and continuous monitoring of the environment and help identify signals of emerging technologies that could alter security and economic prosperity. For example, the research field of hypersonics has recently witnessed several advancements having profound technological, commercial, and national security implications. In this work, we present a multi-layer quantitative approach able to identify future signs from scientific publications on hypersonics by leveraging deep learning and weak signal analysis. The proposed framework can help strategic planners and domain experts better identify and monitor emerging technology trends.
    A Framework for Machine Learning of Model Error in Dynamical Systems. (arXiv:2107.06658v2 [math.DS] UPDATED)
    The development of data-informed predictive models for dynamical systems is of widespread interest in many disciplines. We present a unifying framework for blending mechanistic and machine-learning approaches to identify dynamical systems from noisily and partially observed data. We compare pure data-driven learning with hybrid models which incorporate imperfect domain knowledge. Our formulation is agnostic to the chosen machine learning model, is presented in both continuous- and discrete-time settings, and is compatible both with model errors that exhibit substantial memory and errors that are memoryless. First, we study memoryless linear (w.r.t. parametric-dependence) model error from a learning theory perspective, defining excess risk and generalization error. For ergodic continuous-time systems, we prove that both excess risk and generalization error are bounded above by terms that diminish with the square-root of T, the time-interval over which training data is specified. Secondly, we study scenarios that benefit from modeling with memory, proving universal approximation theorems for two classes of continuous-time recurrent neural networks (RNNs): both can learn memory-dependent model error. In addition, we connect one class of RNNs to reservoir computing, thereby relating learning of memory-dependent error to recent work on supervised learning between Banach spaces using random features. Numerical results are presented (Lorenz '63, Lorenz '96 Multiscale systems) to compare purely data-driven and hybrid approaches, finding hybrid methods less data-hungry and more parametrically efficient. Finally, we demonstrate numerically how data assimilation can be leveraged to learn hidden dynamics from noisy, partially-observed data, and illustrate challenges in representing memory by this approach, and in the training of such models.
    A globally convergent fast iterative shrinkage-thresholding algorithm with a new momentum factor for single and multi-objective convex optimization. (arXiv:2205.05262v1 [math.OC])
    Convex-composite optimization, which minimizes an objective function represented by the sum of a differentiable function and a convex one, is widely used in machine learning and signal/image processing. Fast Iterative Shrinkage Thresholding Algorithm (FISTA) is a typical method for solving this problem and has a global convergence rate of $O(1 / k^2)$. Recently, this has been extended to multi-objective optimization, together with the proof of the $O(1 / k^2)$ global convergence rate. However, its momentum factor is classical, and the convergence of its iterates has not been proven. In this work, introducing some additional hyperparameters $(a, b)$, we propose another accelerated proximal gradient method with a general momentum factor, which is new even for the single-objective cases. We show that our proposed method also has a global convergence rate of $O(1/k^2)$ for any $(a,b)$, and further that the generated sequence of iterates converges to a weak Pareto solution when $a$ is positive, an essential property for the finite-time manifold identification. Moreover, we report numerical results with various $(a,b)$, showing that some of these choices give better results than the classical momentum factors.
    Extensible Machine Learning for Encrypted Network Traffic Application Labeling via Uncertainty Quantification. (arXiv:2205.05628v1 [cs.CR])
    With the increasing prevalence of encrypted network traffic, cyber security analysts have been turning to machine learning (ML) techniques to elucidate the traffic on their networks. However, ML models can become stale as known traffic features can shift between networks and as new traffic emerges that is outside of the distribution of the training set. In order to reliably adapt in this dynamic environment, ML models must additionally provide contextualized uncertainty quantification to their predictions, which has received little attention in the cyber security domain. Uncertainty quantification is necessary both to signal when the model is uncertain about which class to choose in its label assignment and when the traffic is not likely to belong to any pre-trained classes. We present a new, public dataset of network traffic that includes labeled, Virtual Private Network (VPN)-encrypted network traffic generated by 10 applications and corresponding to 5 application categories. We also present an ML framework that is designed to rapidly train with modest data requirements and provide both calibrated, predictive probabilities as well as an interpretable ``out-of-distribution'' (OOD) score to flag novel traffic samples. We describe how to compute a calibrated OOD score from p-values of the so-called relative Mahalanobis distance. We demonstrate that our framework achieves an F1 score of 0.98 on our dataset and that it can extend to an enterprise network by testing the model: (1) on data from similar applications, (2) on dissimilar application traffic from an existing category, and (3) on application traffic from a new category. The model correctly flags uncertain traffic and, upon retraining, accurately incorporates the new data. We additionally demonstrate good performance (F1 score of 0.97) when packet sizes are made to be uniform, as occurs for certain encryption protocols.
    Automatic Tuberculosis and COVID-19 cough classification using deep learning. (arXiv:2205.05480v1 [cs.LG])
    We present a deep learning based automatic cough classifier which can discriminate tuberculosis (TB) coughs from COVID-19 coughs and healthy coughs. Both TB and COVID-19 are respiratory disease, have cough as a predominant symptom and claim thousands of lives each year. The cough audio recordings were collected at both indoor and outdoor settings and also uploaded using smartphones from subjects around the globe, thus contain various levels of noise. This cough data include 1.68 hours of TB coughs, 18.54 minutes of COVID-19 coughs and 1.69 hours of healthy coughs from 47 TB patients, 229 COVID-19 patients and 1498 healthy patients and were used to train and evaluate a CNN, LSTM and Resnet50. These three deep architectures were also pre-trained on 2.14 hours of sneeze, 2.91 hours of speech and 2.79 hours of noise for improved performance. The class-imbalance in our dataset was addressed by using SMOTE data balancing technique and using performance metrics such as F1-score and AUC. Our study shows that the highest F1-scores of 0.9259 and 0.8631 have been achieved from a pre-trained Resnet50 for two-class (TB vs COVID-19) and three-class (TB vs COVID-19 vs healthy) cough classification tasks, respectively. The application of deep transfer learning has improved the classifiers' performance and makes them more robust as they generalise better over the cross-validation folds. Their performances exceed the TB triage test requirements set by the world health organisation (WHO). The features producing the best performance contain higher order of MFCCs suggesting that the differences between TB and COVID-19 coughs are not perceivable by the human ear. This type of cough audio classification is non-contact, cost-effective and can easily be deployed on a smartphone, thus it can be an excellent tool for both TB and COVID-19 screening.
    FairNeuron: Improving Deep Neural Network Fairness with Adversary Games on Selective Neurons. (arXiv:2204.02567v2 [cs.LG] UPDATED)
    With Deep Neural Network (DNN) being integrated into a growing number of critical systems with far-reaching impacts on society, there are increasing concerns on their ethical performance, such as fairness. Unfortunately, model fairness and accuracy in many cases are contradictory goals to optimize. To solve this issue, there has been a number of work trying to improve model fairness by using an adversarial game in model level. This approach introduces an adversary that evaluates the fairness of a model besides its prediction accuracy on the main task, and performs joint-optimization to achieve a balanced result. In this paper, we noticed that when performing backward propagation based training, such contradictory phenomenon has shown on individual neuron level. Based on this observation, we propose FairNeuron, a DNN model automatic repairing tool, to mitigate fairness concerns and balance the accuracy-fairness trade-off without introducing another model. It works on detecting neurons with contradictory optimization directions from accuracy and fairness training goals, and achieving a trade-off by selective dropout. Comparing with state-of-the-art methods, our approach is lightweight, making it scalable and more efficient. Our evaluation on 3 datasets shows that FairNeuron can effectively improve all models' fairness while maintaining a stable utility.
    Subspace Learning Machine (SLM): Methodology and Performance. (arXiv:2205.05296v1 [cs.LG])
    Inspired by the feedforward multilayer perceptron (FF-MLP), decision tree (DT) and extreme learning machine (ELM), a new classification model, called the subspace learning machine (SLM), is proposed in this work. SLM first identifies a discriminant subspace, $S^0$, by examining the discriminant power of each input feature. Then, it uses probabilistic projections of features in $S^0$ to yield 1D subspaces and finds the optimal partition for each of them. This is equivalent to partitioning $S^0$ with hyperplanes. A criterion is developed to choose the best $q$ partitions that yield $2q$ partitioned subspaces among them. We assign $S^0$ to the root node of a decision tree and the intersections of $2q$ subspaces to its child nodes of depth one. The partitioning process is recursively applied at each child node to build an SLM tree. When the samples at a child node are sufficiently pure, the partitioning process stops and each leaf node makes a prediction. The idea can be generalized to regression, leading to the subspace learning regressor (SLR). Furthermore, ensembles of SLM/SLR trees can yield a stronger predictor. Extensive experiments are conducted for performance benchmarking among SLM/SLR trees, ensembles and classical classifiers/regressors.
    Robust Data-Driven Output Feedback Control via Bootstrapped Multiplicative Noise. (arXiv:2205.05119v1 [eess.SY])
    We propose a robust data-driven output feedback control algorithm that explicitly incorporates inherent finite-sample model estimate uncertainties into the control design. The algorithm has three components: (1) a subspace identification nominal model estimator; (2) a bootstrap resampling method that quantifies non-asymptotic variance of the nominal model estimate; and (3) a non-conventional robust control design method comprising a coupled optimal dynamic output feedback filter and controller with multiplicative noise. A key advantage of the proposed approach is that the system identification and robust control design procedures both use stochastic uncertainty representations, so that the actual inherent statistical estimation uncertainty directly aligns with the uncertainty the robust controller is being designed against. Moreover, the control design method accommodates a highly structured uncertainty representation that can capture uncertainty shape more effectively than existing approaches. We show through numerical experiments that the proposed robust data-driven output feedback controller can significantly outperform a certainty equivalent controller on various measures of sample complexity and stability robustness.
    Ranked Prioritization of Groups in Combinatorial Bandit Allocation. (arXiv:2205.05659v1 [cs.AI])
    Preventing poaching through ranger patrols protects endangered wildlife, directly contributing to the UN Sustainable Development Goal 15 of life on land. Combinatorial bandits have been used to allocate limited patrol resources, but existing approaches overlook the fact that each location is home to multiple species in varying proportions, so a patrol benefits each species to differing degrees. When some species are more vulnerable, we ought to offer more protection to these animals; unfortunately, existing combinatorial bandit approaches do not offer a way to prioritize important species. To bridge this gap, (1) We propose a novel combinatorial bandit objective that trades off between reward maximization and also accounts for prioritization over species, which we call ranked prioritization. We show this objective can be expressed as a weighted linear sum of Lipschitz-continuous reward functions. (2) We provide RankedCUCB, an algorithm to select combinatorial actions that optimize our prioritization-based objective, and prove that it achieves asymptotic no-regret. (3) We demonstrate empirically that RankedCUCB leads to up to 38% improvement in outcomes for endangered species using real-world wildlife conservation data. Along with adapting to other challenges such as preventing illegal logging and overfishing, our no-regret algorithm addresses the general combinatorial bandit problem with a weighted linear objective.
    Stable and Interpretable Unrolled Dictionary Learning. (arXiv:2106.00058v4 [cs.LG] UPDATED)
    The dictionary learning problem, representing data as a combination of a few atoms, has long stood as a popular method for learning representations in statistics and signal processing. The most popular dictionary learning algorithm alternates between sparse coding and dictionary update steps, and a rich literature has studied its theoretical convergence. The success of dictionary learning relies on access to a ``good'' initial estimate of the dictionary and the ability of the sparse coding step to provide an unbiased estimate of the code. The growing popularity of unrolled sparse coding networks has led to the empirical finding that backpropagation through such networks performs dictionary learning. We offer the first theoretical analysis of these empirical results through PUDLE, a Provable Unrolled Dictionary LEarning method. We provide conditions on the network initialization and data distribution sufficient to recover and preserve the support of the latent sparse representation. Additionally, we address two challenges; first, the vanilla unrolled sparse coding computes a biased code estimate, and second, gradients during backpropagated learning can become unstable. We show approaches to reduce the bias of the code estimate in the forward pass, and that of the dictionary estimate in the backward pass. We propose strategies to resolve the learning instability. This is achieved by tuning network parameters and modifying the loss function. Overall, we highlight the impact of loss, unrolling, and backpropagation on convergence. We complement our findings through synthetic and image denoising experiments. Finally, we demonstrate PUDLE's interpretability, a driving factor in designing deep networks based on iterative optimizations, by building a mathematical relation between network weights, its output, and the training set.
    Analysis of convolutional neural network image classifiers in a rotationally symmetric model. (arXiv:2205.05500v1 [stat.ML])
    Convolutional neural network image classifiers are defined and the rate of convergence of the misclassification risk of the estimates towards the optimal misclassification risk is analyzed. Here we consider images as random variables with values in some functional space, where we only observe discrete samples as function values on some finite grid. Under suitable structural and smoothness assumptions on the functional a posteriori probability, which includes some kind of symmetry against rotation of subparts of the input image, it is shown that least squares plug-in classifiers based on convolutional neural networks are able to circumvent the curse of dimensionality in binary image classification if we neglect a resolution-dependent error term. The finite sample size behavior of the classifier is analyzed by applying it to simulated and real data.
    Contextual Search in the Presence of Adversarial Corruptions. (arXiv:2002.11650v5 [cs.LG] UPDATED)
    We study contextual search, a generalization of binary search in higher dimensions, which captures settings such as feature-based dynamic pricing. Standard formulations of this problem assume that agents act in accordance with a specific homogeneous response model. In practice, however, some responses may be adversarially corrupted. Existing algorithms heavily depend on the assumed response model being (approximately) accurate for all agents and have poor performance in the presence of even a few such arbitrary misspecifications. We initiate the study of contextual search when some of the agents can behave in ways inconsistent with the underlying response model. In particular, we provide two algorithms, one based on multidimensional binary search methods and one based on gradient descent. We show that these algorithms attain near-optimal regret in the absence of adversarial corruptions and their performance degrades gracefully with the number of such agents, providing the first results for contextual search in any adversarial noise model. Our techniques draw inspiration from learning theory, game theory, high-dimensional geometry, and convex analysis.
    Autoencoder Attractors for Uncertainty Estimation. (arXiv:2204.00382v2 [cs.LG] UPDATED)
    The reliability assessment of a machine learning model's prediction is an important quantity for the deployment in safety critical applications. Not only can it be used to detect novel sceneries, either as out-of-distribution or anomaly sample, but it also helps to determine deficiencies in the training data distribution. A lot of promising research directions have either proposed traditional methods like Gaussian processes or extended deep learning based approaches, for example, by interpreting them from a Bayesian point of view. In this work we propose a novel approach for uncertainty estimation based on autoencoder models: The recursive application of a previously trained autoencoder model can be interpreted as a dynamical system storing training examples as attractors. While input images close to known samples will converge to the same or similar attractor, input samples containing unknown features are unstable and converge to different training samples by potentially removing or changing characteristic features. The use of dropout during training and inference leads to a family of similar dynamical systems, each one being robust on samples close to the training distribution but unstable on new features. Either the model reliably removes these features or the resulting instability can be exploited to detect problematic input samples. We evaluate our approach on several dataset combinations as well as on an industrial application for occupant classification in the vehicle interior for which we additionally release a new synthetic dataset.
    Accelerated Reinforcement Learning for Temporal Logic Control Objectives. (arXiv:2205.04424v2 [cs.RO] UPDATED)
    This paper addresses the problem of learning control policies for mobile robots modeled as unknown Markov Decision Processes (MDPs) that are tasked with temporal logic missions, such as sequencing, coverage, or surveillance. The MDP captures uncertainty in the workspace structure and the outcomes of control decisions. The control objective is to synthesize a control policy that maximizes the probability of accomplishing a high-level task, specified as a Linear Temporal Logic (LTL) formula. To address this problem, we propose a novel accelerated model-based reinforcement learning (RL) algorithm for LTL control objectives that is capable of learning control policies significantly faster than related approaches. Its sample-efficiency relies on biasing exploration towards directions that may contribute to task satisfaction. This is accomplished by leveraging an automaton representation of the LTL task as well as a continuously learned MDP model. Finally, we provide extensive comparative experiments that demonstrate the sample efficiency of the proposed method against recent temporal logic RL methods.
    Machine Learning to Support Triage of Children at Risk for Epileptic Seizures in the Pediatric Intensive Care Unit. (arXiv:2205.05389v1 [cs.LG])
    Objective: Epileptic seizures are relatively common in critically-ill children admitted to the pediatric intensive care unit (PICU) and thus serve as an important target for identification and treatment. Most of these seizures have no discernible clinical manifestation but still have a significant impact on morbidity and mortality. Children that are deemed at risk for seizures within the PICU are monitored using continuous-electroencephalogram (cEEG). cEEG monitoring cost is considerable and as the number of available machines is always limited, clinicians need to resort to triaging patients according to perceived risk in order to allocate resources. This research aims to develop a computer aided tool to improve seizures risk assessment in critically-ill children, using an ubiquitously recorded signal in the PICU, namely the electrocardiogram (ECG). Approach: A novel data-driven model was developed at a patient-level approach, based on features extracted from the first hour of ECG recording and the clinical data of the patient. Main results: The most predictive features were the age of the patient, the brain injury as coma etiology and the QRS area. For patients without any prior clinical data, using one hour of ECG recording, the classification performance of the random forest classifier reached an area under the receiver operating characteristic curve (AUROC) score of 0.84. When combining ECG features with the patients clinical history, the AUROC reached 0.87. Significance: Taking a real clinical scenario, we estimated that our clinical decision support triage tool can improve the positive predictive value by more than 59% over the clinical standard.
    An Inexact Augmented Lagrangian Algorithm for Training Leaky ReLU Neural Network with Group Sparsity. (arXiv:2205.05428v1 [math.OC])
    The leaky ReLU network with a group sparse regularization term has been widely used in the recent years. However, training such a network yields a nonsmooth nonconvex optimization problem and there exists a lack of approaches to compute a stationary point deterministically. In this paper, we first resolve the multi-layer composite term in the original optimization problem by introducing auxiliary variables and additional constraints. We show the new model has a nonempty and bounded solution set and its feasible set satisfies the Mangasarian-Fromovitz constraint qualification. Moreover, we show the relationship between the new model and the original problem. Remarkably, we propose an inexact augmented Lagrangian algorithm for solving the new model and show the convergence of the algorithm to a KKT point. Numerical experiments demonstrate that our algorithm is more efficient for training sparse leaky ReLU neural networks than some well-known algorithms.
    Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models. (arXiv:2112.00029v2 [cs.LG] UPDATED)
    Overparameterized neural networks generalize well but are expensive to train. Ideally, one would like to reduce their computational cost while retaining their generalization benefits. Sparse model training is a simple and promising approach to achieve this, but there remain challenges as existing methods struggle with accuracy loss, slow training runtime, or difficulty in sparsifying all model components. The core problem is that searching for a sparsity mask over a discrete set of sparse matrices is difficult and expensive. To address this, our main insight is to optimize over a continuous superset of sparse matrices with a fixed structure known as products of butterfly matrices. As butterfly matrices are not hardware efficient, we propose simple variants of butterfly (block and flat) to take advantage of modern hardware. Our method (Pixelated Butterfly) uses a simple fixed sparsity pattern based on flat block butterfly and low-rank matrices to sparsify most network layers (e.g., attention, MLP). We empirically validate that Pixelated Butterfly is 3x faster than butterfly and speeds up training to achieve favorable accuracy--efficiency tradeoffs. On the ImageNet classification and WikiText-103 language modeling tasks, our sparse models train up to 2.5x faster than the dense MLP-Mixer, Vision Transformer, and GPT-2 medium with no drop in accuracy.
    Hitting time for Markov decision process. (arXiv:2205.03476v2 [cs.LG] UPDATED)
    We define the hitting time for a Markov decision process (MDP). We do not use the hitting time of the Markov process induced by the MDP because the induced chain may not have a stationary distribution. Even it has a stationary distribution, the stationary distribution may not coincide with the (normalized) occupancy measure of the MDP. We observe a relationship between the MDP and the PageRank. Using this observation, we construct an MP whose stationary distribution coincides with the normalized occupancy measure of the MDP and we define the hitting time of the MDP as the hitting time of the associated MP.
    Reducing Activation Recomputation in Large Transformer Models. (arXiv:2205.05198v1 [cs.LG])
    Training large transformer models is one of the most important computational challenges of modern AI. In this paper, we show how to significantly accelerate training of large transformer models by reducing activation recomputation. Activation recomputation is commonly used to work around memory capacity constraints. Rather than storing activations for backpropagation, they are traditionally recomputed, which saves memory but adds redundant compute. In this work, we show most of this redundant compute is unnecessary because we can reduce memory consumption sufficiently without it. We present two novel yet very simple techniques: sequence parallelism and selective activation recomputation. In conjunction with tensor parallelism, these techniques almost eliminate the need to recompute activations. We evaluate our approach on language models up to one trillion parameters in scale and show that our method reduces activation memory by 5x, while reducing execution time overhead from activation recomputation by over 90%. For example, when training a 530B parameter GPT-3 style model on 2240 NVIDIA A100 GPUs, we achieve a Model Flops Utilization of 54.2%, which is 29% faster than the 42.1% we achieve using recomputation. Our implementation will be available in both Megatron-LM and NeMo-Megatron.
    Pre-trained Language Models as Re-Annotators. (arXiv:2205.05368v1 [cs.CL])
    Annotation noise is widespread in datasets, but manually revising a flawed corpus is time-consuming and error-prone. Hence, given the prior knowledge in Pre-trained Language Models and the expected uniformity across all annotations, we attempt to reduce annotation noise in the corpus through two tasks automatically: (1) Annotation Inconsistency Detection that indicates the credibility of annotations, and (2) Annotation Error Correction that rectifies the abnormal annotations. We investigate how to acquire semantic sensitive annotation representations from Pre-trained Language Models, expecting to embed the examples with identical annotations to the mutually adjacent positions even without fine-tuning. We proposed a novel credibility score to reveal the likelihood of annotation inconsistencies based on the neighbouring consistency. Then, we fine-tune the Pre-trained Language Models based classifier with cross-validation for annotation correction. The annotation corrector is further elaborated with two approaches: (1) soft labelling by Kernel Density Estimation and (2) a novel distant-peer contrastive loss. We study the re-annotation in relation extraction and create a new manually revised dataset, Re-DocRED, for evaluating document-level re-annotation. The proposed credibility scores show promising agreement with human revisions, achieving a Binary F1 of 93.4 and 72.5 in detecting inconsistencies on TACRED and DocRED respectively. Moreover, the neighbour-aware classifiers based on distant-peer contrastive learning and uncertain labels achieve Macro F1 up to 66.2 and 57.8 in correcting annotations on TACRED and DocRED respectively. These improvements are not merely theoretical: Rather, automatically denoised training sets demonstrate up to 3.6% performance improvement for state-of-the-art relation extraction models.
    The First Optimal Algorithm for Smooth and Strongly-Convex-Strongly-Concave Minimax Optimization. (arXiv:2205.05653v1 [math.OC])
    In this paper, we revisit the smooth and strongly-convex-strongly-concave minimax optimization problem. Zhang et al. (2021) and Ibrahim et al. (2020) established the lower bound $\Omega\left(\sqrt{\kappa_x\kappa_y} \log \frac{1}{\epsilon}\right)$ on the number of gradient evaluations required to find an $\epsilon$-accurate solution, where $\kappa_x$ and $\kappa_y$ are condition numbers for the strong convexity and strong concavity assumptions. However, the existing state-of-the-art methods do not match this lower bound: algorithms of Lin et al. (2020) and Wang and Li (2020) have gradient evaluation complexity $\mathcal{O}\left( \sqrt{\kappa_x\kappa_y}\log^3\frac{1}{\epsilon}\right)$ and $\mathcal{O}\left( \sqrt{\kappa_x\kappa_y}\log^3 (\kappa_x\kappa_y)\log\frac{1}{\epsilon}\right)$, respectively. We fix this fundamental issue by providing the first algorithm with $\mathcal{O}\left(\sqrt{\kappa_x\kappa_y}\log\frac{1}{\epsilon}\right)$ gradient evaluation complexity. We design our algorithm in three steps: (i) we reformulate the original problem as a minimization problem via the pointwise conjugate function; (ii) we apply a specific variant of the proximal point algorithm to the reformulated problem; (iii) we compute the proximal operator inexactly using the optimal algorithm for operator norm reduction in monotone inclusions.
    Truncated Emphatic Temporal Difference Methods for Prediction and Control. (arXiv:2108.05338v2 [cs.LG] UPDATED)
    Emphatic Temporal Difference (TD) methods are a class of off-policy Reinforcement Learning (RL) methods involving the use of followon traces. Despite the theoretical success of emphatic TD methods in addressing the notorious deadly triad of off-policy RL, there are still two open problems. First, followon traces typically suffer from large variance, making them hard to use in practice. Second, though Yu (2015) confirms the asymptotic convergence of some emphatic TD methods for prediction problems, there is still no finite sample analysis for any emphatic TD method for prediction, much less control. In this paper, we address those two open problems simultaneously via using truncated followon traces in emphatic TD methods. Unlike the original followon traces, which depend on all previous history, truncated followon traces depend on only finite history, reducing variance and enabling the finite sample analysis of our proposed emphatic TD methods for both prediction and control.
    Scream Detection in Heavy Metal Music. (arXiv:2205.05580v1 [cs.SD])
    Harsh vocal effects such as screams or growls are far more common in heavy metal vocals than the traditionally sung vocal. This paper explores the problem of detection and classification of extreme vocal techniques in heavy metal music, specifically the identification of different scream techniques. We investigate the suitability of various feature representations, including cepstral, spectral, and temporal features as input representations for classification. The main contributions of this work are (i) a manually annotated dataset comprised of over 280 minutes of heavy metal songs of various genres with a statistical analysis of occurrences of different extreme vocal techniques in heavy metal music, and (ii) a systematic study of different input feature representations for the classification of heavy metal vocals
    A Survey on Fairness for Machine Learning on Graphs. (arXiv:2205.05396v1 [cs.LG])
    Nowadays, the analysis of complex phenomena modeled by graphs plays a crucial role in many real-world application domains where decisions can have a strong societal impact. However, numerous studies and papers have recently revealed that machine learning models could lead to potential disparate treatment between individuals and unfair outcomes. In that context, algorithmic contributions for graph mining are not spared by the problem of fairness and present some specific challenges related to the intrinsic nature of graphs: (1) graph data is non-IID, and this assumption may invalidate many existing studies in fair machine learning, (2) suited metric definitions to assess the different types of fairness with relational data and (3) algorithmic challenge on the difficulty of finding a good trade-off between model accuracy and fairness. This survey is the first one dedicated to fairness for relational data. It aims to present a comprehensive review of state-of-the-art techniques in fairness on graph mining and identify the open challenges and future trends. In particular, we start by presenting several sensible application domains and the associated graph mining tasks with a focus on edge prediction and node classification in the sequel. We also recall the different metrics proposed to evaluate potential bias at different levels of the graph mining process; then we provide a comprehensive overview of recent contributions in the domain of fair machine learning for graphs, that we classify into pre-processing, in-processing and post-processing models. We also propose to describe existing graph data, synthetic and real-world benchmarks. Finally, we present in detail five potential promising directions to advance research in studying algorithmic fairness on graphs.
    Symphony Generation with Permutation Invariant Language Model. (arXiv:2205.05448v1 [cs.SD])
    In this work, we present a symbolic symphony music generation solution, SymphonyNet, based on a permutation invariant language model. To bridge the gap between text generation and symphony generation task, we propose a novel Multi-track Multi-instrument Repeatable (MMR) representation with particular 3-D positional embedding and a modified Byte Pair Encoding algorithm (Music BPE) for music tokens. A novel linear transformer decoder architecture is introduced as a backbone for modeling extra-long sequences of symphony tokens. Meanwhile, we train the decoder to learn automatic orchestration as a joint task by masking instrument information from the input. We also introduce a large-scale symbolic symphony dataset for the advance of symphony generation research. Our empirical results show that our proposed approach can generate coherent, novel, complex and harmonious symphony compared to human composition, which is the pioneer solution for multi-track multi-instrument symbolic music generation.
    Quantification of Actual Road User Behavior on the Basis of Given Traffic Rules. (arXiv:2202.09269v3 [cs.RO] UPDATED)
    Driving on roads is restricted by various traffic rules, aiming to ensure safety for all traffic participants. However, human road users usually do not adhere to these rules strictly, resulting in varying degrees of rule conformity. Such deviations from given rules are key components of today's road traffic. In autonomous driving, robotic agents can disturb traffic flow, when rule deviations are not taken into account. In this paper, we present an approach to derive the distribution of degrees of rule conformity from human driving data. We demonstrate our method with the Waymo Open Motion dataset and Safety Distance and Speed Limit rules.
    Re-evaluating Word Mover's Distance. (arXiv:2105.14403v2 [cs.LG] UPDATED)
    The word mover's distance (WMD) is a fundamental technique for measuring the similarity of two documents. As the crux of WMD, it can take advantage of the underlying geometry of the word space by employing an optimal transport formulation. The original study on WMD reported that WMD outperforms classical baselines such as bag-of-words (BOW) and TF-IDF by significant margins in various datasets. In this paper, we point out that the evaluation in the original study could be misleading. We re-evaluate the performances of WMD and the classical baselines and find that the classical baselines are competitive with WMD if we employ an appropriate preprocessing, i.e., L1 normalization. In addition, We introduce an analogy between WMD and L1-normalized BOW and find that not only the performance of WMD but also the distance values resemble those of BOW in high dimensional spaces.
    A Ubiquitous Unifying Degeneracy in Two-Body Microlensing Systems. (arXiv:2111.13696v2 [astro-ph.EP] UPDATED)
    While gravitational microlensing by planetary systems provides unique vistas on the properties of exoplanets, observations of a given 2-body microlensing event can often be interpreted with multiple distinct physical configurations. Such ambiguities are typically attributed to the close-wide and inner-outer types of degeneracies that arise from transformation invariances and symmetries of microlensing caustics. However, there remain unexplained inconsistencies between aforementioned theories and observations. Here, leveraging a fast machine learning inference framework, we present the discovery of the offset degeneracy, which concerns a magnification-matching behaviour on the lens-axis and is formulated independent of caustics. This offset degeneracy unifies the close-wide and inner-outer degeneracies, generalises to resonant topologies, and upon reanalysis, not only appears ubiquitous in previously published planetary events with 2-fold degenerate solutions, but also resolves prior inconsistencies. Our analysis demonstrates that degenerate caustics do not strictly result in degenerate magnifications and that the commonly invoked close-wide degeneracy essentially never arises in actual events. Moreover, it is shown that parameters in offset degenerate configurations are related by a simple expression. This suggests the existence of a deeper symmetry in the equations governing 2-body lenses than previously recognised.
    Influence-Driven Data Poisoning in Graph-Based Semi-Supervised Classifiers. (arXiv:2012.07381v2 [cs.LG] UPDATED)
    Graph-based Semi-Supervised Learning (GSSL) is a practical solution to learn from a limited amount of labelled data together with a vast amount of unlabelled data. However, due to their reliance on the known labels to infer the unknown labels, these algorithms are sensitive to data quality. It is therefore essential to study the potential threats related to the labelled data, more specifically, label poisoning. In this paper, we propose a novel data poisoning method which efficiently approximates the result of label inference to identify the inputs which, if poisoned, would produce the highest number of incorrectly inferred labels. We extensively evaluate our approach on three classification problems under 24 different experimental settings each. Compared to the state of the art, our influence-driven attack produces an average increase of error rate 50\% higher, while being faster by multiple orders of magnitude. Moreover, our method can inform engineers of inputs that deserve investigation (relabelling them) before training the learning model. We show that relabelling one-third of the poisoned inputs (selected based on their influence) reduces the poisoning effect by 50\%.
    Secure Federated Learning for Neuroimaging. (arXiv:2205.05249v1 [cs.LG])
    The amount of biomedical data continues to grow rapidly. However, the ability to collect data from multiple sites for joint analysis remains challenging due to security, privacy, and regulatory concerns. We present a Secure Federated Learning architecture, MetisFL, which enables distributed training of neural networks over multiple data sources without sharing data. Each site trains the neural network over its private data for some time, then shares the neural network parameters (i.e., weights, gradients) with a Federation Controller, which in turn aggregates the local models, sends the resulting community model back to each site, and the process repeats. Our architecture provides strong security and privacy. First, sample data never leaves a site. Second, neural parameters are encrypted before transmission and the community model is computed under fully-homomorphic encryption. Finally, we use information-theoretic methods to limit information leakage from the neural model to prevent a curious site from performing membership attacks. We demonstrate this architecture in neuroimaging. Specifically, we investigate training neural models to classify Alzheimer's disease, and estimate Brain Age, from magnetic resonance imaging datasets distributed across multiple sites, including heterogeneous environments where sites have different amounts of data, statistical distributions, and computational capabilities.
    Theory and Implementation of Process and Temperature Scalable Shape-based CMOS Analog Circuits. (arXiv:2205.05664v1 [cs.AR])
    Analog computing is attractive to its digital counterparts due to its potential for achieving high compute density and energy efficiency. However, the device-to-device variability and challenges in porting existing designs to advance process nodes have posed a major hindrance in harnessing the full potential of analog computations for Machine Learning (ML) applications. This work proposes a novel analog computing framework for designing an analog ML processor similar to that of a digital design - where the designs can be scaled and ported to advanced process nodes without architectural changes. At the core of our work lies shape-based analog computing (S-AC). It utilizes device primitives to yield a robust proto-function through which other non-linear shapes can be derived. S-AC paradigm also allows the user to trade off computational precision with silicon circuit area and power. Thus allowing users to build a truly power-efficient and scalable analog architecture where the same synthesized analog circuit can operate across different biasing regimes of transistors and simultaneously scale across process nodes. As a proof of concept, we show the implementation of commonly used mathematical functions for carrying standard ML tasks in both planar CMOS 180nm and FinFET 7nm process nodes. The synthesized Shape-based ML architecture has been demonstrated for its classification accuracy on standard data sets at different process nodes.
    Generating Annotated Training Data for 6D Object Pose Estimation in Operational Environments with Minimal User Interaction. (arXiv:2103.09696v3 [cs.RO] UPDATED)
    Recently developed deep neural networks achieved state-of-the-art results in the subject of 6D object pose estimation for robot manipulation. However, those supervised deep learning methods require expensive annotated training data. Current methods for reducing those costs frequently use synthetic data from simulations, but rely on expert knowledge and suffer from the "domain gap" when shifting to the real world. Here, we present a proof of concept for a novel approach of autonomously generating annotated training data for 6D object pose estimation. This approach is designed for learning new objects in operational environments while requiring little interaction and no expertise on the part of the user. We evaluate our autonomous data generation approach in two grasping experiments, where we archive a similar grasping success rate as related work on a non autonomously generated data set.
    Utilizing coarse-grained data in low-data settings for event extraction. (arXiv:2205.05468v1 [cs.CL])
    Annotating text data for event information extraction systems is hard, expensive, and error-prone. We investigate the feasibility of integrating coarse-grained data (document or sentence labels), which is far more feasible to obtain, instead of annotating more documents. We utilize a multi-task model with two auxiliary tasks, document and sentence binary classification, in addition to the main task of token classification. We perform a series of experiments with varying data regimes for the aforementioned integration. Results show that while introducing extra coarse-grained data offers greater improvement and robustness, a gain is still possible with only the addition of negative documents that have no information on any event.
    Towards An Efficient Approach for the Nonconvex $\ell_p$ Ball Projection: Algorithm and Analysis. (arXiv:2101.01350v6 [math.OC] UPDATED)
    This paper primarily focuses on computing the Euclidean projection of a vector onto the $\ell_{p}$ ball in which $p\in(0,1)$. Such a problem emerges as the core building block in statistical machine learning and signal processing tasks because of its ability to promote the sparsity of the desired solution. However, efficient numerical algorithms for finding the projections are still not available, particularly in large-scale optimization. To meet this challenge, we first derive the first-order necessary optimality conditions of this problem. Based on this characterization, we develop a novel numerical approach for computing the stationary point by solving a sequence of projections onto the reweighted $\ell_{1}$-balls. This method is practically simple to implement and computationally efficient. Moreover, the proposed algorithm is shown to converge uniquely under mild conditions and has a worst-case $O(1/\sqrt{k})$ convergence rate. Numerical experiments demonstrate the efficiency of our proposed algorithm.
    Towards Model Agnostic Federated Learning Using Knowledge Distillation. (arXiv:2110.15210v2 [cs.LG] UPDATED)
    Is it possible to design an universal API for federated learning using which an ad-hoc group of data-holders (agents) collaborate with each other and perform federated learning? Such an API would necessarily need to be model-agnostic i.e. make no assumption about the model architecture being used by the agents, and also cannot rely on having representative public data at hand. Knowledge distillation (KD) is the obvious tool of choice to design such protocols. However, surprisingly, we show that most natural KD-based federated learning protocols have poor performance. To investigate this, we propose a new theoretical framework, Federated Kernel ridge regression, which can capture both model heterogeneity as well as data heterogeneity. Our analysis shows that the degradation is largely due to a fundamental limitation of knowledge distillation under data heterogeneity. We further validate our framework by analyzing and designing new protocols based on KD. Their performance on real world experiments using neural networks, though still unsatisfactory, closely matches our theoretical predictions.
    A Word is Worth A Thousand Dollars: Adversarial Attack on Tweets Fools Stock Prediction. (arXiv:2205.01094v2 [cs.CR] UPDATED)
    More and more investors and machine learning models rely on social media (e.g., Twitter and Reddit) to gather real-time information and sentiment to predict stock price movements. Although text-based models are known to be vulnerable to adversarial attacks, whether stock prediction models have similar vulnerability is underexplored. In this paper, we experiment with a variety of adversarial attack configurations to fool three stock prediction victim models. We address the task of adversarial generation by solving combinatorial optimization problems with semantics and budget constraints. Our results show that the proposed attack method can achieve consistent success rates and cause significant monetary loss in trading simulation by simply concatenating a perturbed but semantically similar tweet.
    Salient Object Detection via Bounding-box Supervision. (arXiv:2205.05245v1 [cs.CV])
    The success of fully supervised saliency detection models depends on a large number of pixel-wise labeling. In this paper, we work on bounding-box based weakly-supervised saliency detection to relieve the labeling effort. Given the bounding box annotation, we observe that pixels inside the bounding box may contain extensive labeling noise. However, as a large amount of background is excluded, the foreground bounding box region contains a less complex background, making it possible to perform handcrafted features-based saliency detection with only the cropped foreground region. As the conventional handcrafted features are not representative enough, leading to noisy saliency maps, we further introduce structure-aware self-supervised loss to regularize the structure of the prediction. Further, we claim that pixels outside the bounding box should be background, thus partial cross-entropy loss function can be used to accurately localize the accurate background region. Experimental results on six benchmark RGB saliency datasets illustrate the effectiveness of our model.
    Social Inclusion in Curated Contexts: Insights from Museum Practices. (arXiv:2205.05192v1 [cs.LG])
    Artificial intelligence literature suggests that minority and fragile communities in society can be negatively impacted by machine learning algorithms due to inherent biases in the design process, which lead to socially exclusive decisions and policies. Faced with similar challenges in dealing with an increasingly diversified audience, the museum sector has seen changes in theory and practice, particularly in the areas of representation and meaning-making. While rarity and grandeur used to be at the centre stage of the early museum practices, folk life and museums' relationships with the diverse communities they serve become a widely integrated part of the contemporary practices. These changes address issues of diversity and accessibility in order to offer more socially inclusive services. Drawing on these changes and reflecting back on the AI world, we argue that the museum experience provides useful lessons for building AI with socially inclusive approaches, especially in situations in which both a collection and access to it will need to be curated or filtered, as frequently happens in search engines, recommender systems and digital libraries. We highlight three principles: (1) Instead of upholding the value of neutrality, practitioners are aware of the influences of their own backgrounds and those of others on their work. By not claiming to be neutral but practising cultural humility, the chances of addressing potential biases can be increased. (2) There should be room for situational interpretation beyond the stages of data collection and machine learning. Before applying models and predictions, the contexts in which relevant parties exist should be taken into account. (3) Community participation serves the needs of communities and has the added benefit of bringing practitioners and communities together.
    Learning Multitask Gaussian Bayesian Networks. (arXiv:2205.05343v1 [stat.ML])
    Major depressive disorder (MDD) requires study of brain functional connectivity alterations for patients, which can be uncovered by resting-state functional magnetic resonance imaging (rs-fMRI) data. We consider the problem of identifying alterations of brain functional connectivity for a single MDD patient. This is particularly difficult since the amount of data collected during an fMRI scan is too limited to provide sufficient information for individual analysis. Additionally, rs-fMRI data usually has the characteristics of incompleteness, sparsity, variability, high dimensionality and high noise. To address these problems, we proposed a multitask Gaussian Bayesian network (MTGBN) framework capable for identifying individual disease-induced alterations for MDD patients. We assume that such disease-induced alterations show some degrees of similarity with the tool to learn such network structures from observations to understanding of how system are structured jointly from related tasks. First, we treat each patient in a class of observation as a task and then learn the Gaussian Bayesian networks (GBNs) of this data class by learning from all tasks that share a default covariance matrix that encodes prior knowledge. This setting can help us to learn more information from limited data. Next, we derive a closed-form formula of the complete likelihood function and use the Monte-Carlo Expectation-Maximization(MCEM) algorithm to search for the approximately best Bayesian network structures efficiently. Finally, we assess the performance of our methods with simulated and real-world rs-fMRI data.
    CNN-LSTM Based Multimodal MRI and Clinical Data Fusion for Predicting Functional Outcome in Stroke Patients. (arXiv:2205.05545v1 [eess.IV])
    Clinical outcome prediction plays an important role in stroke patient management. From a machine learning point-of-view, one of the main challenges is dealing with heterogeneous data at patient admission, i.e. the image data which are multidimensional and the clinical data which are scalars. In this paper, a multimodal convolutional neural network - long short-term memory (CNN-LSTM) based ensemble model is proposed. For each MR image module, a dedicated network provides preliminary prediction of the clinical outcome using the modified Rankin scale (mRS). The final mRS score is obtained by merging the preliminary probabilities of each module dedicated to a specific type of MR image weighted by the clinical metadata, here age or the National Institutes of Health Stroke Scale (NIHSS). The experimental results demonstrate that the proposed model surpasses the baselines and offers an original way to automatically encode the spatio-temporal context of MR images in a deep learning architecture. The highest AUC (0.77) was achieved for the proposed model with NIHSS.
    On Distributed Adaptive Optimization with Gradient Compression. (arXiv:2205.05632v1 [stat.ML])
    We study COMP-AMS, a distributed optimization framework based on gradient averaging and adaptive AMSGrad algorithm. Gradient compression with error feedback is applied to reduce the communication cost in the gradient transmission process. Our convergence analysis of COMP-AMS shows that such compressed gradient averaging strategy yields same convergence rate as standard AMSGrad, and also exhibits the linear speedup effect w.r.t. the number of local workers. Compared with recently proposed protocols on distributed adaptive methods, COMP-AMS is simple and convenient. Numerical experiments are conducted to justify the theoretical findings, and demonstrate that the proposed method can achieve same test accuracy as the full-gradient AMSGrad with substantial communication savings. With its simplicity and efficiency, COMP-AMS can serve as a useful distributed training framework for adaptive gradient methods.
    RvS: What is Essential for Offline RL via Supervised Learning?. (arXiv:2112.10751v2 [cs.LG] UPDATED)
    Recent work has shown that supervised learning alone, without temporal difference (TD) learning, can be remarkably effective for offline RL. When does this hold true, and which algorithmic components are necessary? Through extensive experiments, we boil supervised learning for offline RL down to its essential elements. In every environment suite we consider, simply maximizing likelihood with a two-layer feedforward MLP is competitive with state-of-the-art results of substantially more complex methods based on TD learning or sequence modeling with Transformers. Carefully choosing model capacity (e.g., via regularization or architecture) and choosing which information to condition on (e.g., goals or rewards) are critical for performance. These insights serve as a field guide for practitioners doing Reinforcement Learning via Supervised Learning (which we coin "RvS learning"). They also probe the limits of existing RvS methods, which are comparatively weak on random data, and suggest a number of open problems.
    Self Reward Design with Fine-grained Interpretability. (arXiv:2112.15034v2 [cs.LG] UPDATED)
    Transparency and fairness issues stem from the black-box nature of deep neural networks (DNN). They are relevant to Deep Reinforcement Learning which also use DNN to learn its policy, value functions etc. This paper proposes a way to circumvent the issues through the bottom-up design of neural networks (NN) with detailed interpretability, where each neuron or layer has its own meaning and utility that corresponds to humanly understandable concept. With deliberate design, we show that lavaland problems can be solved using NN model with few parameters. Furthermore, we introduce the Self Reward Design (SRD), inspired by the Inverse Reward Design, so that our interpretable design can (1) solve the problem by pure design (although imperfectly) (2) be optimized via SRD (3) perform avoidance of unknown states by recognizing the inactivations of neurons aggregated as the activation in \(w_{unknown}\).
    RepSR: Training Efficient VGG-style Super-Resolution Networks with Structural Re-Parameterization and Batch Normalization. (arXiv:2205.05671v1 [cs.CV])
    This paper explores training efficient VGG-style super-resolution (SR) networks with the structural re-parameterization technique. The general pipeline of re-parameterization is to train networks with multi-branch topology first, and then merge them into standard 3x3 convolutions for efficient inference. In this work, we revisit those primary designs and investigate essential components for re-parameterizing SR networks. First of all, we find that batch normalization (BN) is important to bring training non-linearity and improve the final performance. However, BN is typically ignored in SR, as it usually degrades the performance and introduces unpleasant artifacts. We carefully analyze the cause of BN issue and then propose a straightforward yet effective solution. In particular, we first train SR networks with mini-batch statistics as usual, and then switch to using population statistics at the later training period. While we have successfully re-introduced BN into SR, we further design a new re-parameterizable block tailored for SR, namely RepSR. It consists of a clean residual path and two expand-and-squeeze convolution paths with the modified BN. Extensive experiments demonstrate that our simple RepSR is capable of achieving superior performance to previous SR re-parameterization methods among different model sizes. In addition, our RepSR can achieve a better trade-off between performance and actual running time (throughput) than previous SR methods. Codes will be available at https://github.com/TencentARC/RepSR.
    Post-Hoc Explanations Fail to Achieve their Purpose in Adversarial Contexts. (arXiv:2201.10295v2 [cs.LG] UPDATED)
    Existing and planned legislation stipulates various obligations to provide information about machine learning algorithms and their functioning, often interpreted as obligations to "explain". Many researchers suggest using post-hoc explanation algorithms for this purpose. In this paper, we combine legal, philosophical and technical arguments to show that post-hoc explanation algorithms are unsuitable to achieve the law's objectives. Indeed, most situations where explanations are requested are adversarial, meaning that the explanation provider and receiver have opposing interests and incentives, so that the provider might manipulate the explanation for her own ends. We show that this fundamental conflict cannot be resolved because of the high degree of ambiguity of post-hoc explanations in realistic application scenarios. As a consequence, post-hoc explanation algorithms are unsuitable to achieve the transparency objectives inherent to the legal norms. Instead, there is a need to more explicitly discuss the objectives underlying "explainability" obligations as these can often be better achieved through other mechanisms. There is an urgent need for a more open and honest discussion regarding the potential and limitations of post-hoc explanations in adversarial contexts, in particular in light of the current negotiations of the European Union's draft Artificial Intelligence Act.
    Hierarchical Collaborative Hyper-parameter Tuning. (arXiv:2205.05272v1 [cs.LG])
    Hyper-parameter Tuning is among the most critical stages in building machine learning solutions. This paper demonstrates how multi-agent systems can be utilized to develop a distributed technique for determining near-optimal values for any arbitrary set of hyper-parameters in a machine learning model. The proposed method employs a distributedly formed hierarchical agent-based architecture for the cooperative searching procedure of tuning hyper-parameter values. The presented generic model is used to develop a guided randomized agent-based tuning technique, and its behavior is investigated in both machine learning and global function optimization applications. According the empirical results, the proposed model outperformed both of its underlying randomized tuning strategies in terms of classification error and function evaluations, notably in higher number of dimensions.
    Efficient Automated Deep Learning for Time Series Forecasting. (arXiv:2205.05511v1 [cs.LG])
    Recent years have witnessed tremendously improved efficiency of Automated Machine Learning (AutoML), especially Automated Deep Learning (AutoDL) systems, but recent work focuses on tabular, image, or NLP tasks. So far, little attention has been paid to general AutoDL frameworks for time series forecasting, despite the enormous success in applying different novel architectures to such tasks. In this paper, we propose an efficient approach for the joint optimization of neural architecture and hyperparameters of the entire data processing pipeline for time series forecasting. In contrast to common NAS search spaces, we designed a novel neural architecture search space covering various state-of-the-art architectures, allowing for an efficient macro-search over different DL approaches. To efficiently search in such a large configuration space, we use Bayesian optimization with multi-fidelity optimization. We empirically study several different budget types enabling efficient multi-fidelity optimization on different forecasting datasets. Furthermore, we compared our resulting system, dubbed Auto-PyTorch-TS, against several established baselines and show that it significantly outperforms all of them across several datasets.
    DoubleMatch: Improving Semi-Supervised Learning with Self-Supervision. (arXiv:2205.05575v1 [cs.LG])
    Following the success of supervised learning, semi-supervised learning (SSL) is now becoming increasingly popular. SSL is a family of methods, which in addition to a labeled training set, also use a sizable collection of unlabeled data for fitting a model. Most of the recent successful SSL methods are based on pseudo-labeling approaches: letting confident model predictions act as training labels. While these methods have shown impressive results on many benchmark datasets, a drawback of this approach is that not all unlabeled data are used during training. We propose a new SSL algorithm, DoubleMatch, which combines the pseudo-labeling technique with a self-supervised loss, enabling the model to utilize all unlabeled data in the training process. We show that this method achieves state-of-the-art accuracies on multiple benchmark datasets while also reducing training times compared to existing SSL methods. Code is available at https://github.com/walline/doublematch.
    Deep fusion of gray level co-occurrence matrices for lung nodule classification. (arXiv:2205.05123v1 [eess.IV])
    Lung cancer is a severe menace to human health, due to which millions of people die because of late diagnoses of cancer; thus, it is vital to detect the disease as early as possible. The Computerized chest analysis Tomography of scan is assumed to be one of the efficient solutions for detecting and classifying lung nodules. The necessity of high accuracy of analyzing C.T. scan images of the lung is considered as one of the crucial challenges in detecting and classifying lung cancer. A new long-short-term-memory (LSTM) based deep fusion structure, is introduced, where, the texture features computed from lung nodules through new volumetric grey-level-co-occurrence-matrices (GLCM) computations are applied to classify the nodules into: benign, malignant and ambiguous. An improved Otsu segmentation method combined with the water strider optimization algorithm (WSA) is proposed to detect the lung nodules. Otsu-WSA thresholding can overcome the restrictions present in previous thresholding methods. Extended experiments are run to assess this fusion structure by considering 2D-GLCM computations based 2D-slices fusion, and an approximation of this 3D-GLCM with volumetric 2.5D-GLCM computations-based LSTM fusion structure. The proposed methods are trained and assessed through the LIDC-IDRI dataset, where 94.4%, 91.6%, and 95.8% Accuracy, sensitivity, and specificity are obtained, respectively for 2D-GLCM fusion and 97.33%, 96%, and 98%, accuracy, sensitivity, and specificity, respectively, for 2.5D-GLCM fusion. The yield of the same are 98.7%, 98%, and 99%, for the 3D-GLCM fusion. The obtained results and analysis indicate that the WSA-Otsu method requires less execution time and yields a more accurate thresholding process. It is found that 3D-GLCM based LSTM outperforms its counterparts.
    Best of Both Worlds: Multi-task Audio-Visual Automatic Speech Recognition and Active Speaker Detection. (arXiv:2205.05206v1 [eess.AS])
    Under noisy conditions, automatic speech recognition (ASR) can greatly benefit from the addition of visual signals coming from a video of the speaker's face. However, when multiple candidate speakers are visible this traditionally requires solving a separate problem, namely active speaker detection (ASD), which entails selecting at each moment in time which of the visible faces corresponds to the audio. Recent work has shown that we can solve both problems simultaneously by employing an attention mechanism over the competing video tracks of the speakers' faces, at the cost of sacrificing some accuracy on active speaker detection. This work closes this gap in active speaker detection accuracy by presenting a single model that can be jointly trained with a multi-task loss. By combining the two tasks during training we reduce the ASD classification accuracy by approximately 25%, while simultaneously improving the ASR performance when compared to the multi-person baseline trained exclusively for ASR.
    Quantum Self-Attention Neural Networks for Text Classification. (arXiv:2205.05625v1 [quant-ph])
    An emerging direction of quantum computing is to establish meaningful quantum applications in various fields of artificial intelligence, including natural language processing (NLP). Although some efforts based on syntactic analysis have opened the door to research in Quantum NLP (QNLP), limitations such as heavy syntactic preprocessing and syntax-dependent network architecture make them impracticable on larger and real-world data sets. In this paper, we propose a new simple network architecture, called the quantum self-attention neural network (QSANN), which can make up for these limitations. Specifically, we introduce the self-attention mechanism into quantum neural networks and then utilize a Gaussian projected quantum self-attention serving as a sensible quantum version of self-attention. As a result, QSANN is effective and scalable on larger data sets and has the desirable property of being implementable on near-term quantum devices. In particular, our QSANN outperforms the best existing QNLP model based on syntactic analysis as well as a simple classical self-attention neural network in numerical experiments of text classification tasks on public data sets. We further show that our method exhibits robustness to low-level quantum noises.
    Deep Architecture Connectivity Matters for Its Convergence: A Fine-Grained Analysis. (arXiv:2205.05662v1 [cs.LG])
    Advanced deep neural networks (DNNs), designed by either human or AutoML algorithms, are growing increasingly complex. Diverse operations are connected by complicated connectivity patterns, e.g., various types of skip connections. Those topological compositions are empirically effective and observed to smooth the loss landscape and facilitate the gradient flow in general. However, it remains elusive to derive any principled understanding of their effects on the DNN capacity or trainability, and to understand why or in which aspect one specific connectivity pattern is better than another. In this work, we theoretically characterize the impact of connectivity patterns on the convergence of DNNs under gradient descent training in fine granularity. By analyzing a wide network's Neural Network Gaussian Process (NNGP), we are able to depict how the spectrum of an NNGP kernel propagates through a particular connectivity pattern, and how that affects the bound of convergence rates. As one practical implication of our results, we show that by a simple filtration on "unpromising" connectivity patterns, we can trim down the number of models to evaluate, and significantly accelerate the large-scale neural architecture search without any overhead. Codes will be released at https://github.com/chenwydj/architecture_convergence.
    Human Language Modeling. (arXiv:2205.05128v1 [cs.CL])
    Natural language is generated by people, yet traditional language modeling views words or documents as if generated independently. Here, we propose human language modeling (HuLM), a hierarchical extension to the language modeling problem whereby a human-level exists to connect sequences of documents (e.g. social media messages) and capture the notion that human language is moderated by changing human states. We introduce, HaRT, a large-scale transformer model for the HuLM task, pre-trained on approximately 100,000 social media users, and demonstrate its effectiveness in terms of both language modeling (perplexity) for social media and fine-tuning for 4 downstream tasks spanning document- and user-levels: stance detection, sentiment classification, age estimation, and personality assessment. Results on all tasks meet or surpass the current state-of-the-art.
    DeepFilterNet2: Towards Real-Time Speech Enhancement on Embedded Devices for Full-Band Audio. (arXiv:2205.05474v1 [eess.AS])
    Deep learning-based speech enhancement has seen huge improvements and recently also expanded to full band audio (48 kHz). However, many approaches have a rather high computational complexity and require big temporal buffers for real time usage e.g. due to temporal convolutions or attention. Both make those approaches not feasible on embedded devices. This work further extends DeepFilterNet, which exploits harmonic structure of speech allowing for efficient speech enhancement (SE). Several optimizations in the training procedure, data augmentation, and network structure result in state-of-the-art SE performance while reducing the real-time factor to 0.04 on a notebook Core-i5 CPU. This makes the algorithm applicable to run on embedded devices in real-time. The DeepFilterNet framework can be obtained under an open source license.
    The Multiscale Structure of Neural Network Loss Functions: The Effect on Optimization and Origin. (arXiv:2204.11326v2 [cs.LG] UPDATED)
    Local quadratic approximation has been extensively used to study the optimization of neural network loss functions around the minimum. Though, it usually holds in a very small neighborhood of the minimum, and cannot explain many phenomena observed during the optimization process. In this work, we study the structure of neural network loss functions and its implication on optimization in a region beyond the reach of good quadratic approximation. Numerically, we observe that neural network loss functions possesses a multiscale structure, manifested in two ways: (1) in a neighborhood of minima, the loss mixes a continuum of scales and grows subquadratically, and (2) in a larger region, the loss shows several separate scales clearly. Using the subquadratic growth, we are able to explain the Edge of Stability phenomenon[4] observed for gradient descent (GD) method. Using the separate scales, we explain the working mechanism of learning rate decay by simple examples. Finally, we study the origin of the multiscale structure and propose that the non-uniformity of training data is one of its cause. By constructing a two-layer neural network problem we show that training data with different magnitudes give rise to different scales of the loss function, producing subquadratic growth or multiple separate scales.
    RISP: Rendering-Invariant State Predictor with Differentiable Simulation and Rendering for Cross-Domain Parameter Estimation. (arXiv:2205.05678v1 [cs.CV])
    This work considers identifying parameters characterizing a physical system's dynamic motion directly from a video whose rendering configurations are inaccessible. Existing solutions require massive training data or lack generalizability to unknown rendering configurations. We propose a novel approach that marries domain randomization and differentiable rendering gradients to address this problem. Our core idea is to train a rendering-invariant state-prediction (RISP) network that transforms image differences into state differences independent of rendering configurations, e.g., lighting, shadows, or material reflectance. To train this predictor, we formulate a new loss on rendering variances using gradients from differentiable rendering. Moreover, we present an efficient, second-order method to compute the gradients of this loss, allowing it to be integrated seamlessly into modern deep learning frameworks. We evaluate our method in rigid-body and deformable-body simulation environments using four tasks: state estimation, system identification, imitation learning, and visuomotor control. We further demonstrate the efficacy of our approach on a real-world example: inferring the state and action sequences of a quadrotor from a video of its motion sequences. Compared with existing methods, our approach achieves significantly lower reconstruction errors and has better generalizability among unknown rendering configurations.
    Spatial-temporal associations representation and application for process monitoring using graph convolution neural network. (arXiv:2205.05250v1 [cs.LG])
    Industrial process data reflects the dynamic changes of operation conditions, which mainly refer to the irregular changes in the dynamic associations between different variables in different time. And this related associations knowledge for process monitoring is often implicit in these dynamic monitoring data which always have richer operation condition information and have not been paid enough attention in current research. To this end, a new process monitoring method based on spatial-based graph convolution neural network (SGCN) is proposed to describe the characteristics of the dynamic associations which can be used to represent the operation status over time. Spatia-temporal graphs are firstly defined, which can be used to represent the characteristics of node attributes (dynamic edge features) dynamically changing with time. Then, the associations between monitoring variables at a certain time can be considered as the node attributes to define a snapshot of the static graph network at the certain time. Finally, the snapshot containing graph structure and node attributes is used as model inputs which are processed to implement graph classification by spatial-based convolution graph neural network with aggregate and readout steps. The feasibility and applicability of this proposed method are demonstrated by our experimental results of benchmark and practical case application.
    Reducing a complex two-sided smartwatch examination for Parkinson's Disease to an efficient one-sided examination preserving machine learning accuracy. (arXiv:2205.05361v1 [cs.LG])
    Sensors from smart consumer devices have demonstrated high potential to serve as digital biomarkers in the identification of movement disorders in recent years. With the usage of broadly available smartwatches we have recorded participants performing technology-based assessments in a prospective study to research Parkinson's Disease (PD). In total, 504 participants, including PD patients, differential diagnoses (DD) and healthy controls (HC), were captured with a comprehensive system utilizing two smartwatches and two smartphones. To the best of our knowledge, this study provided the largest PD sample size of two-hand synchronous smartwatch measurements. To establish a future easy-to use home-based assessment system in PD screening, we systematically evaluated the performance of the system based on a significantly reduced set of assessments with only one-sided measures and assessed, whether we can maintain classification accuracy.
    Making Pre-trained Language Models Good Long-tailed Learners. (arXiv:2205.05461v1 [cs.CL])
    Prompt-tuning has shown appealing performance in few-shot classification by virtue of its capability in effectively exploiting pre-trained knowledge. This motivates us to check the hypothesis that prompt-tuning is also a promising choice for long-tailed classification, since the tail classes are intuitively few-shot ones. To achieve this aim, we conduct empirical studies to examine the hypothesis. The results demonstrate that prompt-tuning exactly makes pre-trained language models at least good long-tailed learners. For intuitions on why prompt-tuning can achieve good performance in long-tailed classification, we carry out an in-depth analysis by progressively bridging the gap between prompt-tuning and commonly used fine-tuning. The summary is that the classifier structure and parameterization form the key to making good long-tailed learners, in comparison with the less important input structure. Finally, we verify the applicability of our finding to few-shot classification.
    Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning. (arXiv:2205.05638v1 [cs.LG])
    Few-shot in-context learning (ICL) enables pre-trained language models to perform a previously-unseen task without any gradient-based training by feeding a small number of training examples as part of the input. ICL incurs substantial computational, memory, and storage costs because it involves processing all of the training examples every time a prediction is made. Parameter-efficient fine-tuning (e.g. adapter modules, prompt tuning, sparse update methods, etc.) offers an alternative paradigm where a small set of parameters are trained to enable a model to perform the new task. In this paper, we rigorously compare few-shot ICL and parameter-efficient fine-tuning and demonstrate that the latter offers better accuracy as well as dramatically lower computational costs. Along the way, we introduce a new parameter-efficient fine-tuning method called (IA)$^3$ that scales activations by learned vectors, attaining stronger performance while only introducing a relatively tiny amount of new parameters. We also propose a simple recipe based on the T0 model called T-Few that can be applied to new tasks without task-specific tuning or modifications. We validate the effectiveness of T-Few on completely unseen tasks by applying it to the RAFT benchmark, attaining super-human performance for the first time and outperforming the state-of-the-art by 6% absolute. All of the code used in our experiments is publicly available.
    Choice of training label matters: how to best use deep learning for quantitative MRI parameter estimation. (arXiv:2205.05587v1 [physics.med-ph])
    Deep learning (DL) is gaining popularity as a parameter estimation method for quantitative MRI. A range of competing implementations have been proposed, relying on either supervised or self-supervised learning. Self-supervised approaches, sometimes referred to as unsupervised, have been loosely based on auto-encoders, whereas supervised methods have, to date, been trained on groundtruth labels. These two learning paradigms have been shown to have distinct strengths. Notably, self-supervised approaches have offered lower-bias parameter estimates than their supervised alternatives. This result is counterintuitive - incorporating prior knowledge with supervised labels should, in theory, lead to improved accuracy. In this work, we show that this apparent limitation of supervised approaches stems from the naive choice of groundtruth training labels. By training on labels which are deliberately not groundtruth, we show that the low-bias parameter estimation previously associated with self-supervised methods can be replicated - and improved on - within a supervised learning framework. This approach sets the stage for a single, unifying, deep learning parameter estimation framework, based on supervised learning, where trade-offs between bias and variance are made by careful adjustment of training label.
    Benchmarking Graph Neural Networks. (arXiv:2003.00982v4 [cs.LG] UPDATED)
    In the last few years, graph neural networks (GNNs) have become the standard toolkit for analyzing and learning from data on graphs. This emerging field has witnessed an extensive growth of promising techniques that have been applied with success to computer science, mathematics, biology, physics and chemistry. But for any successful field to become mainstream and reliable, benchmarks must be developed to quantify progress. This led us in March 2020 to release a benchmark framework that i) comprises of a diverse collection of mathematical and real-world graphs, ii) enables fair model comparison with the same parameter budget to identify key architectures, iii) has an open-source, easy-to-use and reproducible code infrastructure, and iv) is flexible for researchers to experiment with new theoretical ideas. As of May 2022, the GitHub repository has reached 1,800 stars and 339 forks, which demonstrates the utility of the proposed open-source framework through the wide usage by the GNN community. In this paper, we present an updated version of our benchmark with a concise presentation of the aforementioned framework characteristics, an additional medium-sized molecular dataset AQSOL, similar to the popular ZINC, but with a real-world measured chemical target, and discuss how this framework can be leveraged to explore new GNN designs and insights. As a proof of value of our benchmark, we study the case of graph positional encoding (PE) in GNNs, which was introduced with this benchmark and has since spurred interest of exploring more powerful PE for Transformers and GNNs in a robust experimental setting.
    CVTT: Cross-Validation Through Time. (arXiv:2205.05393v1 [cs.LG])
    The practical aspects of evaluating recommender systems is an actively discussed topic in the research community. While many current evaluation techniques bring performance down to a single-value metric as a straightforward approach for model comparison, it is based on a strong assumption of the methods' stable performance over time. In this paper, we argue that leaving out a method's continuous performance can lead to losing valuable insight into joint data-method effects. We propose the Cross-Validation Thought Time (CVTT) technique to perform more detailed evaluations, which focus on model cross-validation performance over time. Using the proposed technique, we conduct a detailed analysis of popular RecSys algorithms' performance against various metrics and datasets. We also compare several data preparation and evaluation strategies to analyze their impact on model performance. Our results show that model performance can vary significantly over time, and both data and evaluation setup can have a marked effect on it.
    End-to-End Multi-Person Audio/Visual Automatic Speech Recognition. (arXiv:2205.05586v1 [eess.AS])
    Traditionally, audio-visual automatic speech recognition has been studied under the assumption that the speaking face on the visual signal is the face matching the audio. However, in a more realistic setting, when multiple faces are potentially on screen one needs to decide which face to feed to the A/V ASR system. The present work takes the recent progress of A/V ASR one step further and considers the scenario where multiple people are simultaneously on screen (multi-person A/V ASR). We propose a fully differentiable A/V ASR model that is able to handle multiple face tracks in a video. Instead of relying on two separate models for speaker face selection and audio-visual ASR on a single face track, we introduce an attention layer to the ASR encoder that is able to soft-select the appropriate face video track. Experiments carried out on an A/V system trained on over 30k hours of YouTube videos illustrate that the proposed approach can automatically select the proper face tracks with minor WER degradation compared to an oracle selection of the speaking face while still showing benefits of employing the visual signal instead of the audio alone.
    Spatial Graph Attention and Curiosity-driven Policy for Antiviral Drug Discovery. (arXiv:2106.02190v6 [cs.LG] UPDATED)
    We developed Distilled Graph Attention Policy Network (DGAPN), a reinforcement learning model to generate novel graph-structured chemical representations that optimize user-defined objectives by efficiently navigating a physically constrained domain. The framework is examined on the task of generating molecules that are designed to bind, noncovalently, to functional sites of SARS-CoV-2 proteins. We present a spatial Graph Attention (sGAT) mechanism that leverages self-attention over both node and edge attributes as well as encoding the spatial structure -- this capability is of considerable interest in synthetic biology and drug discovery. An attentional policy network is introduced to learn the decision rules for a dynamic, fragment-based chemical environment, and state-of-the-art policy gradient techniques are employed to train the network with stability. Exploration is driven by the stochasticity of the action space design and the innovation reward bonuses learned and proposed by random network distillation. In experiments, our framework achieved outstanding results compared to state-of-the-art algorithms, while reducing the complexity of paths to chemical synthesis.
    Learning Spatiotemporal Chaos Using Next-Generation Reservoir Computing. (arXiv:2203.13294v2 [cs.LG] UPDATED)
    Forecasting the behavior of high-dimensional dynamical systems using machine learning requires efficient methods to learn the underlying physical model. We demonstrate spatiotemporal chaos prediction using a machine learning architecture that, when combined with a next-generation reservoir computer, displays state-of-the-art performance with a training time $10^3-10^4$ times faster and training data set $\sim 10^2$ times smaller than other machine learning algorithms. We also take advantage of the translational symmetry of the model to further reduce the computational cost and training data, each by a factor of $\sim$10.
    Performance of a deep learning system for detection of referable diabetic retinopathy in real clinical settings. (arXiv:2205.05554v1 [eess.IV])
    Background: To determine the ability of a commercially available deep learning system, RetCAD v.1.3.1 (Thirona, Nijmegen, The Netherlands) for the automatic detection of referable diabetic retinopathy (DR) on a dataset of colour fundus images acquired during routine clinical practice in a tertiary hospital screening program, analyzing the reduction of workload that can be released incorporating this artificial intelligence-based technology. Methods: Evaluation of the software was performed on a dataset of 7195 nonmydriatic fundus images from 6325 eyes of 3189 diabetic patients attending our screening program between February to December of 2019. The software generated a DR severity score for each colour fundus image which was combined into an eye-level score. This score was then compared with a reference standard as set by a human expert using receiver operating characteristic (ROC) curve analysis. Results: The artificial intelligence (AI) software achieved an area under the ROC curve (AUC) value of 0.988 [0.981:0.993] for the detection of referable DR. At the proposed operating point, the sensitivity of the RetCAD software for DR is 90.53% and specificity is 97.13%. A workload reduction of 96% could be achieved at the cost of only 6 false negatives. Conclusions: The AI software correctly identified the vast majority of referable DR cases, with a workload reduction of 96% of the cases that would need to be checked, while missing almost no true cases, so it may therefore be used as an instrument for triage.
    Deep Graph Clustering via Mutual Information Maximization and Mixture Model. (arXiv:2205.05168v1 [cs.LG])
    Attributed graph clustering or community detection which learns to cluster the nodes of a graph is a challenging task in graph analysis. In this paper, we introduce a contrastive learning framework for learning clustering-friendly node embedding. Although graph contrastive learning has shown outstanding performance in self-supervised graph learning, using it for graph clustering is not well explored. We propose Gaussian mixture information maximization (GMIM) which utilizes a mutual information maximization approach for node embedding. Meanwhile, it assumes that the representation space follows a Mixture of Gaussians (MoG) distribution. The clustering part of our objective tries to fit a Gaussian distribution to each community. The node embedding is jointly optimized with the parameters of MoG in a unified framework. Experiments on real-world datasets demonstrate the effectiveness of our method in community detection.
    Eliminating Sharp Minima from SGD with Truncated Heavy-tailed Noise. (arXiv:2102.04297v4 [cs.LG] UPDATED)
    The empirical success of deep learning is often attributed to SGD's mysterious ability to avoid sharp local minima in the loss landscape, as sharp minima are known to lead to poor generalization. Recently, empirical evidence of heavy-tailed gradient noise was reported in many deep learning tasks, and it was shown in \c{S}im\c{s}ekli (2019a,b) that SGD can escape sharp local minima under the presence of such heavy-tailed gradient noise, providing a partial solution to the mystery. In this work, we analyze a popular variant of SGD where gradients are truncated above a fixed threshold. We show that it achieves a stronger notion of avoiding sharp minima: it can effectively eliminate sharp local minima entirely from its training trajectory. We characterize the dynamics of truncated SGD driven by heavy-tailed noises. First, we show that the truncation threshold and width of the attraction field dictate the order of the first exit time from the associated local minimum. Moreover, when the objective function satisfies appropriate structural conditions, we prove that as the learning rate decreases, the dynamics of heavy-tailed truncated SGD closely resemble those of a continuous-time Markov chain that never visits any sharp minima. Real data experiments on deep learning confirm our theoretical prediction that heavy-tailed SGD with gradient clipping finds a "flatter" local minima and achieves better generalization.
    A simple framework for contrastive learning phases of matter. (arXiv:2205.05607v1 [cond-mat.dis-nn])
    A main task in condensed-matter physics is to recognize, classify, and characterize phases of matter and the corresponding phase transitions, for which machine learning provides a new class of research tools due to the remarkable development in computing power and algorithms. Despite much exploration in this new field, usually different methods and techniques are needed for different scenarios. Here, we present SimCLP: a simple framework for contrastive learning phases of matter, which is inspired by the recent development in contrastive learning of visual representations. We demonstrate the success of this framework on several representative systems, including classical and quantum, single-particle and many-body, conventional and topological. SimCLP is flexible and free of usual burdens such as manual feature engineering and prior knowledge. The only prerequisite is to prepare enough state configurations. Furthermore, it can generate representation vectors and labels and hence help tackle other problems. SimCLP therefore paves an alternative way to the development of a generic tool for identifying unexplored phase transitions.
    A Model-Free Sampling Method for Estimating Basins of Attraction Using Hybrid Active Learning (HAL). (arXiv:2003.10976v3 [cs.LG] UPDATED)
    Understanding the basins of attraction (BoA) is often a paramount consideration for nonlinear systems. Most existing approaches to determining a high-resolution BoA require prior knowledge of the system's dynamical model (e.g., differential equation or point mapping for continuous systems, cell mapping for discrete systems, etc.), which allows derivation of approximate analytical solutions or parallel computing on a multi-core computer to find the BoA efficiently. However, these methods are typically impractical when the BoA must be determined experimentally or when the system's model is unknown. This paper introduces a model-free sampling method for BoA. The proposed method is based upon hybrid active learning (HAL) and is designed to find and label the "informative" samples, which efficiently determine the boundary of BoA. It consists of three primary parts: 1) additional sampling on trajectories (AST) to maximize the number of samples obtained from each simulation or experiment; 2) an active learning (AL) algorithm to exploit the local boundary of BoA; and 3) a density-based sampling (DBS) method to explore the global boundary of BoA. An example of estimating the BoA for a bistable nonlinear system is presented to show the high efficiency of our HAL sampling method.
    Stochastic differential equations for limiting description of UCB rule for Gaussian multi-armed bandits. (arXiv:2112.06423v2 [cs.LG] UPDATED)
    We consider the upper confidence bound strategy for Gaussian multi-armed bandits with known control horizon sizes $N$ and build its limiting description with a system of stochastic differential equations and ordinary differential equations. Rewards for the arms are assumed to have unknown expected values and known variances. A set of Monte-Carlo simulations was performed for the case of close distributions of rewards, when mean rewards differ by the magnitude of order $N^{-1/2}$, as it yields the highest normalized regret, to verify the validity of the obtained description. The minimal size of the control horizon when the normalized regret is not noticeably larger than maximum possible was estimated.
    Sibylvariant Transformations for Robust Text Classification. (arXiv:2205.05137v1 [cs.CL])
    The vast majority of text transformation techniques in NLP are inherently limited in their ability to expand input space coverage due to an implicit constraint to preserve the original class label. In this work, we propose the notion of sibylvariance (SIB) to describe the broader set of transforms that relax the label-preserving constraint, knowably vary the expected class, and lead to significantly more diverse input distributions. We offer a unified framework to organize all data transformations, including two types of SIB: (1) Transmutations convert one discrete kind into another, (2) Mixture Mutations blend two or more classes together. To explore the role of sibylvariance within NLP, we implemented 41 text transformations, including several novel techniques like Concept2Sentence and SentMix. Sibylvariance also enables a unique form of adaptive training that generates new input mixtures for the most confused class pairs, challenging the learner to differentiate with greater nuance. Our experiments on six benchmark datasets strongly support the efficacy of sibylvariance for generalization performance, defect detection, and adversarial robustness.
    DNA data storage, sequencing data-carrying DNA. (arXiv:2205.05488v1 [cs.ET])
    DNA is a leading candidate as the next archival storage media due to its density, durability and sustainability. To read (and write) data DNA storage exploits technology that has been developed over decades to sequence naturally occurring DNA in the life sciences. To achieve higher accuracy for previously unseen, biological DNA, sequencing relies on extending and training deep machine learning models known as basecallers. This growth in model complexity requires substantial resources, both computational and data sets. It also eliminates the possibility of a compact read head for DNA as a storage medium. We argue that we need to depart from blindly using sequencing models from the life sciences for DNA data storage. The difference is striking: for life science applications we have no control over the DNA, however, in the case of DNA data storage, we control how it is written, as well as the particular write head. More specifically, data-carrying DNA can be modulated and embedded with alignment markers and error correcting codes to guarantee higher fidelity and to carry out some of the work that the machine learning models perform. In this paper, we study accuracy trade-offs between deep model size and error correcting codes. We show that, starting with a model size of 107MB, the reduced accuracy from model compression can be compensated by using simple error correcting codes in the DNA sequences. In our experiments, we show that a substantial reduction in the size of the model does not incur an undue penalty for the error correcting codes used, therefore paving the way for portable data-carrying DNA read head. Crucially, we show that through the joint use of model compression and error correcting codes, we achieve a higher read accuracy than without compression and error correction codes.
    A Unified f-divergence Framework Generalizing VAE and GAN. (arXiv:2205.05214v1 [stat.ML])
    Developing deep generative models that flexibly incorporate diverse measures of probability distance is an important area of research. Here we develop an unified mathematical framework of f-divergence generative model, f-GM, that incorporates both VAE and f-GAN, and enables tractable learning with general f-divergences. f-GM allows the experimenter to flexibly design the f-divergence function without changing the structure of the networks or the learning procedure. f-GM jointly models three components: a generator, a inference network and a density estimator. Therefore it simultaneously enables sampling, posterior inference of the latent variable as well as evaluation of the likelihood of an arbitrary datum. f-GM belongs to the class of encoder-decoder GANs: our density estimator can be interpreted as playing the role of a discriminator between samples in the joint space of latent code and observed space. We prove that f-GM naturally simplifies to the standard VAE and to f-GAN as special cases, and illustrates the connections between different encoder-decoder GAN architectures. f-GM is compatible with general network architecture and optimizer. We leverage it to experimentally explore the effects -- e.g. mode collapse and image sharpness -- of different choices of f-divergence.
    REIN-2: Giving Birth to Prepared Reinforcement Learning Agents Using Reinforcement Learning Agents. (arXiv:2110.05128v2 [cs.LG] UPDATED)
    Deep Reinforcement Learning (Deep RL) has been in the spotlight for the past few years, due to its remarkable abilities to solve problems which were considered to be practically unsolvable using traditional Machine Learning methods. However, even state-of-the-art Deep RL algorithms have various weaknesses that prevent them from being used extensively within industry applications, with one such major weakness being their sample-inefficiency. In an effort to patch these issues, we integrated a meta-learning technique in order to shift the objective of learning to solve a task into the objective of learning how to learn to solve a task (or a set of tasks), which we empirically show that improves overall stability and performance of Deep RL algorithms. Our model, named REIN-2, is a meta-learning scheme formulated within the RL framework, the goal of which is to develop a meta-RL agent (meta-learner) that learns how to produce other RL agents (inner-learners) that are capable of solving given environments. For this task, we convert the typical interaction of an RL agent with the environment into a new, single environment for the meta-learner to interact with. Compared to traditional state-of-the-art Deep RL algorithms, experimental results show remarkable performance of our model in popular OpenAI Gym environments in terms of scoring and sample efficiency, including the Mountain Car hard-exploration environment.
    Advanced sleep spindle identification with neural networks. (arXiv:2202.05158v2 [eess.SP] UPDATED)
    Sleep spindles are neurophysiological phenomena that appear to be linked to memory formation and other functions of the central nervous system, and that can be observed in electroencephalographic recordings (EEG) during sleep. Manually identified spindle annotations in EEG recordings suffer from substantial intra- and inter-rater variability, even if raters have been highly trained, which reduces the reliability of spindle measures as a research and diagnostic tool. The Massive Online Data Annotation (MODA) project has recently addressed this problem by forming a consensus from multiple such rating experts, thus providing a corpus of spindle annotations of enhanced quality. Based on this dataset, we present a U-Net-type deep neural network model to automatically detect sleep spindles. Our model's performance exceeds that of the state-of-the-art detector and of most experts in the MODA dataset. We observed improved detection accuracy in subjects of all ages, including older individuals whose spindles are particularly challenging to detect reliably. Our results underline the potential of automated methods to do repetitive cumbersome tasks with super-human performance.
    Extracting Latent Steering Vectors from Pretrained Language Models. (arXiv:2205.05124v1 [cs.CL])
    Prior work on controllable text generation has focused on learning how to control language models through trainable decoding, smart-prompt design, or fine-tuning based on a desired objective. We hypothesize that the information needed to steer the model to generate a target sentence is already encoded within the model. Accordingly, we explore a different approach altogether: extracting latent vectors directly from pretrained language model decoders without fine-tuning. Experiments show that there exist steering vectors, which, when added to the hidden states of the language model, generate a target sentence nearly perfectly (> 99 BLEU) for English sentences from a variety of domains. We show that vector arithmetic can be used for unsupervised sentiment transfer on the Yelp sentiment benchmark, with performance comparable to models tailored to this task. We find that distances between steering vectors reflect sentence similarity when evaluated on a textual similarity benchmark (STS-B), outperforming pooled hidden states of models. Finally, we present an analysis of the intrinsic properties of the steering vectors. Taken together, our results suggest that frozen LMs can be effectively controlled through their latent steering space.
    Evaluation Gaps in Machine Learning Practice. (arXiv:2205.05256v1 [cs.LG])
    Forming a reliable judgement of a machine learning (ML) model's appropriateness for an application ecosystem is critical for its responsible use, and requires considering a broad range of factors including harms, benefits, and responsibilities. In practice, however, evaluations of ML models frequently focus on only a narrow range of decontextualized predictive behaviours. We examine the evaluation gaps between the idealized breadth of evaluation concerns and the observed narrow focus of actual evaluations. Through an empirical study of papers from recent high-profile conferences in the Computer Vision and Natural Language Processing communities, we demonstrate a general focus on a handful of evaluation methods. By considering the metrics and test data distributions used in these methods, we draw attention to which properties of models are centered in the field, revealing the properties that are frequently neglected or sidelined during evaluation. By studying these properties, we demonstrate the machine learning discipline's implicit assumption of a range of commitments which have normative impacts; these include commitments to consequentialism, abstractability from context, the quantifiability of impacts, the limited role of model inputs in evaluation, and the equivalence of different failure modes. Shedding light on these assumptions enables us to question their appropriateness for ML system contexts, pointing the way towards more contextualized evaluation methodologies for robustly examining the trustworthiness of ML models
    Internet of Behavior (IoB) and Explainable AI Systems for Influencing IoT Behavior. (arXiv:2109.07239v2 [cs.DC] UPDATED)
    Pandemics and natural disasters over the years have changed the behavior of people, which has had a tremendous impact on all life aspects. With the technologies available in each era, governments, organizations, and companies have used these technologies to track, control, and influence the behavior of individuals for a benefit. Nowadays, the use of the Internet of Things (IoT), cloud computing, and artificial intelligence (AI) have made it easier to track and change the behavior of users through changing IoT behavior. This article introduces and discusses the concept of the Internet of Behavior (IoB) and its integration with Explainable AI (XAI) techniques to provide trusted and evident experience in the process of changing IoT behavior to ultimately improving users' behavior. Therefore, a system based on IoB and XAI has been proposed in a use case scenario of electrical power consumption that aims to influence user consuming behavior to reduce power consumption and cost. The scenario results showed a decrease of 522.2 kW of active power when compared to original consumption over a 200-hours period. It also showed a total power cost saving of 95.04 Euro for the same period. Moreover, decreasing the global active power will reduce the power intensity through the positive correlation.
    Is calibration a fairness requirement? An argument from the point of view of moral philosophy and decision theory. (arXiv:2205.05512v1 [cs.LG])
    In this paper, we provide a moral analysis of two criteria of statistical fairness debated in the machine learning literature: 1) calibration between groups and 2) equality of false positive and false negative rates between groups. In our paper, we focus on moral arguments in support of either measure. The conflict between group calibration vs. false positive and false negative rate equality is one of the core issues in the debate about group fairness definitions among practitioners. For any thorough moral analysis, the meaning of the term fairness has to be made explicit and defined properly. For our paper, we equate fairness with (non-)discrimination, which is a legitimate understanding in the discussion about group fairness. More specifically, we equate it with prima facie wrongful discrimination in the sense this is used in Prof. Lippert-Rasmussen's treatment of this definition. In this paper, we argue that a violation of group calibration may be unfair in some cases, but not unfair in others. This is in line with claims already advanced in the literature, that algorithmic fairness should be defined in a way that is sensitive to context. The most important practical implication is that arguments based on examples in which fairness requires between-group calibration, or equality in the false-positive/false-negative rates, do no generalize. For it may be that group calibration is a fairness requirement in one case, but not in another.
    RLOP: RL Methods in Option Pricing from a Mathematical Perspective. (arXiv:2205.05600v1 [q-fin.PR])
    Abstract In this work, we build two environments, namely the modified QLBS and RLOP models, from a mathematics perspective which enables RL methods in option pricing through replicating by portfolio. We implement the environment specifications (the source code can be found at https://github.com/owen8877/RLOP), the learning algorithm, and agent parametrization by a neural network. The learned optimal hedging strategy is compared against the BS prediction. The effect of various factors is considered and studied based on how they affect the optimal price and position.
    Stochastic Variational Smoothed Model Checking. (arXiv:2205.05398v1 [cs.LG])
    Model-checking for parametric stochastic models can be expressed as checking the satisfaction probability of a certain property as a function of the parameters of the model. Smoothed model checking (smMC) leverages Gaussian Processes (GP) to infer the satisfaction function over the entire parameter space from a limited set of observations obtained via simulation. This approach provides accurate reconstructions with statistically sound quantification of the uncertainty. However, it inherits the scalability issues of GP. In this paper, we exploit recent advances in probabilistic machine learning to push this limitation forward, making Bayesian inference of smMC scalable to larger datasets, enabling its application to larger models in terms of the dimension of the parameter set. We propose Stochastic Variational Smoothed Model Checking (SV-smMC), a solution that exploits stochastic variational inference (SVI) to approximate the posterior distribution of the smMC problem. The strength and flexibility of SVI make SV-smMC applicable to two alternative probabilistic models: Gaussian Processes (GP) and Bayesian Neural Networks (BNN). Moreover, SVI makes inference easily parallelizable and it enables GPU acceleration. In this paper, we compare the performances of smMC against those of SV-smMC by looking at the scalability, the computational efficiency and at the accuracy of the reconstructed satisfaction function.
    Federated Learning from Only Unlabeled Data with Class-Conditional-Sharing Clients. (arXiv:2204.03304v2 [cs.LG] UPDATED)
    Supervised federated learning (FL) enables multiple clients to share the trained model without sharing their labeled data. However, potential clients might even be reluctant to label their own data, which could limit the applicability of FL in practice. In this paper, we show the possibility of unsupervised FL whose model is still a classifier for predicting class labels, if the class-prior probabilities are shifted while the class-conditional distributions are shared among the unlabeled data owned by the clients. We propose federation of unsupervised learning (FedUL), where the unlabeled data are transformed into surrogate labeled data for each of the clients, a modified model is trained by supervised FL, and the wanted model is recovered from the modified model. FedUL is a very general solution to unsupervised FL: it is compatible with many supervised FL methods, and the recovery of the wanted model can be theoretically guaranteed as if the data have been labeled. Experiments on benchmark and real-world datasets demonstrate the effectiveness of FedUL. Code is available at https://github.com/lunanbit/FedUL.
    Analysis of three dimensional potential problems in non-homogeneous media with physics-informed deep collocation method using material transfer learning and sensitivity analysis. (arXiv:2010.12060v2 [cs.LG] UPDATED)
    In this work, we present a deep collocation method for three dimensional potential problems in nonhomogeneous media. This approach utilizes a physics informed neural network with material transfer learning reducing the solution of the nonhomogeneous partial differential equations to an optimization problem. We tested different cofigurations of the physics informed neural network including smooth activation functions, sampling methods for collocation points generation and combined optimizers. A material transfer learning technique is utilised for nonhomogeneous media with different material gradations and parameters, which enhance the generality and robustness of the proposed method. In order to identify the most influential parameters of the network configuration, we carried out a global sensitivity analysis. Finally, we provide a convergence proof of our DCM. The approach is validated through several benchmark problems, also testing different material variations.
    Blockchain-based Secure Client Selection in Federated Learning. (arXiv:2205.05611v1 [cs.CR])
    Despite the great potential of Federated Learning (FL) in large-scale distributed learning, the current system is still subject to several privacy issues due to the fact that local models trained by clients are exposed to the central server. Consequently, secure aggregation protocols for FL have been developed to conceal the local models from the server. However, we show that, by manipulating the client selection process, the server can circumvent the secure aggregation to learn the local models of a victim client, indicating that secure aggregation alone is inadequate for privacy protection. To tackle this issue, we leverage blockchain technology to propose a verifiable client selection protocol. Owing to the immutability and transparency of blockchain, our proposed protocol enforces a random selection of clients, making the server unable to control the selection process at its discretion. We present security proofs showing that our protocol is secure against this attack. Additionally, we conduct several experiments on an Ethereum-like blockchain to demonstrate the feasibility and practicality of our solution.
    Characterizing the Action-Generalization Gap in Deep Q-Learning. (arXiv:2205.05588v1 [cs.AI])
    We study the action generalization ability of deep Q-learning in discrete action spaces. Generalization is crucial for efficient reinforcement learning (RL) because it allows agents to use knowledge learned from past experiences on new tasks. But while function approximation provides deep RL agents with a natural way to generalize over state inputs, the same generalization mechanism does not apply to discrete action outputs. And yet, surprisingly, our experiments indicate that Deep Q-Networks (DQN), which use exactly this type of function approximator, are still able to achieve modest action generalization. Our main contribution is twofold: first, we propose a method of evaluating action generalization using expert knowledge of action similarity, and empirically confirm that action generalization leads to faster learning; second, we characterize the action-generalization gap (the difference in learning performance between DQN and the expert) in different domains. We find that DQN can indeed generalize over actions in several simple domains, but that its ability to do so decreases as the action space grows larger.
    Weak Supervision with Incremental Source Accuracy Estimation. (arXiv:2205.05302v1 [cs.LG])
    Motivated by the desire to generate labels for real-time data we develop a method to estimate the dependency structure and accuracy of weak supervision sources incrementally. Our method first estimates the dependency structure associated with the supervision sources and then uses this to iteratively update the estimated source accuracies as new data is received. Using both off-the-shelf classification models trained using publicly-available datasets and heuristic functions as supervision sources we show that our method generates probabilistic labels with an accuracy matching that of existing off-line methods.
    Aggregating Pairwise Semantic Differences for Few-Shot Claim Veracity Classification. (arXiv:2205.05646v1 [cs.CL])
    As part of an automated fact-checking pipeline, the claim veracity classification task consists in determining if a claim is supported by an associated piece of evidence. The complexity of gathering labelled claim-evidence pairs leads to a scarcity of datasets, particularly when dealing with new domains. In this paper, we introduce SEED, a novel vector-based method to few-shot claim veracity classification that aggregates pairwise semantic differences for claim-evidence pairs. We build on the hypothesis that we can simulate class representative vectors that capture average semantic differences for claim-evidence pairs in a class, which can then be used for classification of new instances. We compare the performance of our method with competitive baselines including fine-tuned BERT/RoBERTa models, as well as the state-of-the-art few-shot veracity classification method that leverages language model perplexity. Experiments conducted on the FEVER and SCIFACT datasets show consistent improvements over competitive baselines in few-shot settings. Our code is available.
    Generation of non-stationary stochastic fields using Generative Adversarial Networks with limited training data. (arXiv:2205.05469v1 [cs.LG])
    In the context of generating geological facies conditioned on observed data, samples corresponding to all possible conditions are not generally available in the training set and hence the generation of these realizations depends primary on the generalization capability of the trained generative model. The problem becomes more complex when applied on non-stationary fields. In this work, we investigate the problem of training Generative Adversarial Networks (GANs) models against a dataset of geological channelized patterns that has a few non-stationary spatial modes and examine the training and self-conditioning settings that improve the generalization capability at new spatial modes that were never seen in the given training set. The developed training method allowed for effective learning of the correlation between the spatial conditions (i.e. non-stationary maps) and the realizations implicitly without using additional loss terms or solving a costly optimization problem at the realization generation phase. Our models, trained on real and artificial datasets were able to generate geologically-plausible realizations beyond the training samples with a strong correlation with the target maps.
    Contrastive Supervised Distillation for Continual Representation Learning. (arXiv:2205.05476v1 [cs.CV])
    In this paper, we propose a novel training procedure for the continual representation learning problem in which a neural network model is sequentially learned to alleviate catastrophic forgetting in visual search tasks. Our method, called Contrastive Supervised Distillation (CSD), reduces feature forgetting while learning discriminative features. This is achieved by leveraging labels information in a distillation setting in which the student model is contrastively learned from the teacher model. Extensive experiments show that CSD performs favorably in mitigating catastrophic forgetting by outperforming current state-of-the-art methods. Our results also provide further evidence that feature forgetting evaluated in visual retrieval tasks is not as catastrophic as in classification tasks. Code at: https://github.com/NiccoBiondi/ContrastiveSupervisedDistillation.
    Predicting hot electrons free energies from ground-state data. (arXiv:2205.05591v1 [cond-mat.mtrl-sci])
    Machine-learning potentials are usually trained on the ground-state, Born-Oppenheimer energy surface, which depends exclusively on the atomic positions and not on the simulation temperature. This disregards the effect of thermally-excited electrons, that is important in metals, and essential to the description of warm dense matter. An accurate physical description of these effects requires that the nuclei move on a temperature-dependent electronic free energy. We propose a method to obtain machine-learning predictions of this free energy at an arbitrary electron temperature using exclusively training data from ground-state calculations, avoiding the need to train temperature-dependent potentials. We benchmark our method on metallic liquid hydrogen at the conditions of the core of gas giants and brown dwarfs.
    Delayed Reinforcement Learning by Imitation. (arXiv:2205.05569v1 [cs.LG])
    When the agent's observations or interactions are delayed, classic reinforcement learning tools usually fail. In this paper, we propose a simple yet new and efficient solution to this problem. We assume that, in the undelayed environment, an efficient policy is known or can be easily learned, but the task may suffer from delays in practice and we thus want to take them into account. We present a novel algorithm, Delayed Imitation with Dataset Aggregation (DIDA), which builds upon imitation learning methods to learn how to act in a delayed environment from undelayed demonstrations. We provide a theoretical analysis of the approach that will guide the practical design of DIDA. These results are also of general interest in the delayed reinforcement learning literature by providing bounds on the performance between delayed and undelayed tasks, under smoothness conditions. We show empirically that DIDA obtains high performances with a remarkable sample efficiency on a variety of tasks, including robotic locomotion, classic control, and trading.
    Efficient Distributed Framework for Collaborative Multi-Agent Reinforcement Learning. (arXiv:2205.05248v1 [cs.AI])
    Multi-agent reinforcement learning for incomplete information environments has attracted extensive attention from researchers. However, due to the slow sample collection and poor sample exploration, there are still some problems in multi-agent reinforcement learning, such as unstable model iteration and low training efficiency. Moreover, most of the existing distributed framework are proposed for single-agent reinforcement learning and not suitable for multi-agent. In this paper, we design an distributed MARL framework based on the actor-work-learner architecture. In this framework, multiple asynchronous environment interaction modules can be deployed simultaneously, which greatly improves the sample collection speed and sample diversity. Meanwhile, to make full use of computing resources, we decouple the model iteration from environment interaction, and thus accelerate the policy iteration. Finally, we verified the effectiveness of propose framework in MaCA military simulation environment and the SMAC 3D realtime strategy gaming environment with imcomplete information characteristics.  ( 2 min )
    What is Proxy Discrimination?. (arXiv:2205.05265v1 [cs.LG])
    The near universal condemnation of proxy discrimination hides a disagreement over what it is. This work surveys various notions of proxy and proxy discrimination found in prior work and represents them in a common framework. These notions variously turn on statistical dependencies, causal effects, and intentions. It discusses the limitations and uses of each notation and of the concept as a whole.
    Access Trends of In-network Cache for Scientific Data. (arXiv:2205.05563v1 [cs.NI])
    Scientific collaborations are increasingly relying on large volumes of data for their work and many of them employ tiered systems to replicate the data to their worldwide user communities. Each user in the community often selects a different subset of data for their analysis tasks; however, members of a research group often are working on related research topics that require similar data objects. Thus, there is a significant amount of data sharing possible. In this work, we study the access traces of a federated storage cache known as the Southern California Petabyte Scale Cache. By studying the access patterns and potential for network traffic reduction by this caching system, we aim to explore the predictability of the cache uses and the potential for a more general in-network data caching. Our study shows that this distributed storage cache is able to reduce the network traffic volume by a factor of 2.35 during a part of the study period. We further show that machine learning models could predict cache utilization with an accuracy of 0.88. This demonstrates that such cache usage is predictable, which could be useful for managing complex networking resources such as in-network caching.  ( 2 min )
    NDGGNET-A Node Independent Gate based Graph Neural Networks. (arXiv:2205.05348v1 [cs.LG])
    Graph Neural Networks (GNNs) is an architecture for structural data, and has been adopted in a mass of tasks and achieved fabulous results, such as link prediction, node classification, graph classification and so on. Generally, for a certain node in a given graph, a traditional GNN layer can be regarded as an aggregation from one-hop neighbors, thus a set of stacked layers are able to fetch and update node status within multi-hops. For nodes with sparse connectivity, it is difficult to obtain enough information through a single GNN layer as not only there are only few nodes directly connected to them but also can not propagate the high-order neighbor information. However, as the number of layer increases, the GNN model is prone to over-smooth for nodes with the dense connectivity, which resulting in the decrease of accuracy. To tackle this issue, in this thesis, we define a novel framework that allows the normal GNN model to accommodate more layers. Specifically, a node-degree based gate is employed to adjust weight of layers dynamically, that try to enhance the information aggregation ability and reduce the probability of over-smoothing. Experimental results show that our proposed model can effectively increase the model depth and perform well on several datasets.  ( 2 min )
    Automated differential equation solver based on the parametric approximation optimization. (arXiv:2205.05383v1 [math.NA])
    The numerical methods for differential equation solution allow obtaining a discrete field that converges towards the solution if the method is applied to the correct problem. Nevertheless, the numerical methods have the restricted class of the equations, on which the convergence with a given parameter set or range is proved. Only a few "cheap and dirty" numerical methods converge on a wide class of equations without parameter tuning with the lower approximation order price. The article presents a method that uses an optimization algorithm to obtain a solution using the parameterized approximation. The result may not be as precise as an expert one. However, it allows solving the wide class of equations in an automated manner without the algorithm's parameters change.  ( 2 min )
  • Open

    Imitation of Manipulation Skills Using Multiple Geometries. (arXiv:2203.01171v2 [cs.RO] UPDATED)
    Daily manipulation tasks are characterized by regular characteristics associated with the task structure, which can be described by multiple geometric primitives related to actions and object shapes. Such geometric descriptors can not be expressed only in Cartesian coordinate systems. In this paper, we propose a learning approach to extract the optimal representation from a dictionary of coordinate systems to represent an observed movement. This is achieved by using an extension of Gaussian distributions on Riemannian manifolds, which is used to analyse a set of user demonstrations statistically, by considering multiple geometries as candidate representations of the task. We formulate the reproduction problem as a general optimal control problem based on an iterative linear quadratic regulator (iLQR), where the Gaussian distribution in the extracted coordinate systems are used to define the cost function. We apply our approach to grasping and box opening tasks in simulation and on a 7-axis Franka Emika robot. The results show that the robot can exploit several geometries to execute the manipulation task and generalize it to new situations, by maintaining the invariant features of the skill in the coordinate system(s) of interest.
    Boundary Estimation from Point Clouds: Algorithms, Guarantees and Applications. (arXiv:2111.03217v2 [math.NA] UPDATED)
    We investigate identifying the boundary of a domain from sample points in the domain. We introduce new estimators for the normal vector to the boundary, distance of a point to the boundary, and a test for whether a point lies within a boundary strip. The estimators can be efficiently computed and are more accurate than the ones present in the literature. We provide rigorous error estimates for the estimators. Furthermore we use the detected boundary points to solve boundary-value problems for PDE on point clouds. We prove error estimates for the Laplace and eikonal equations on point clouds. Finally we provide a range of numerical experiments illustrating the performance of our boundary estimators, applications to PDE on point clouds, and tests on image data sets.
    A Model-Free Sampling Method for Estimating Basins of Attraction Using Hybrid Active Learning (HAL). (arXiv:2003.10976v3 [cs.LG] UPDATED)
    Understanding the basins of attraction (BoA) is often a paramount consideration for nonlinear systems. Most existing approaches to determining a high-resolution BoA require prior knowledge of the system's dynamical model (e.g., differential equation or point mapping for continuous systems, cell mapping for discrete systems, etc.), which allows derivation of approximate analytical solutions or parallel computing on a multi-core computer to find the BoA efficiently. However, these methods are typically impractical when the BoA must be determined experimentally or when the system's model is unknown. This paper introduces a model-free sampling method for BoA. The proposed method is based upon hybrid active learning (HAL) and is designed to find and label the "informative" samples, which efficiently determine the boundary of BoA. It consists of three primary parts: 1) additional sampling on trajectories (AST) to maximize the number of samples obtained from each simulation or experiment; 2) an active learning (AL) algorithm to exploit the local boundary of BoA; and 3) a density-based sampling (DBS) method to explore the global boundary of BoA. An example of estimating the BoA for a bistable nonlinear system is presented to show the high efficiency of our HAL sampling method.
    RvS: What is Essential for Offline RL via Supervised Learning?. (arXiv:2112.10751v2 [cs.LG] UPDATED)
    Recent work has shown that supervised learning alone, without temporal difference (TD) learning, can be remarkably effective for offline RL. When does this hold true, and which algorithmic components are necessary? Through extensive experiments, we boil supervised learning for offline RL down to its essential elements. In every environment suite we consider, simply maximizing likelihood with a two-layer feedforward MLP is competitive with state-of-the-art results of substantially more complex methods based on TD learning or sequence modeling with Transformers. Carefully choosing model capacity (e.g., via regularization or architecture) and choosing which information to condition on (e.g., goals or rewards) are critical for performance. These insights serve as a field guide for practitioners doing Reinforcement Learning via Supervised Learning (which we coin "RvS learning"). They also probe the limits of existing RvS methods, which are comparatively weak on random data, and suggest a number of open problems.  ( 2 min )
    Towards Model Agnostic Federated Learning Using Knowledge Distillation. (arXiv:2110.15210v2 [cs.LG] UPDATED)
    Is it possible to design an universal API for federated learning using which an ad-hoc group of data-holders (agents) collaborate with each other and perform federated learning? Such an API would necessarily need to be model-agnostic i.e. make no assumption about the model architecture being used by the agents, and also cannot rely on having representative public data at hand. Knowledge distillation (KD) is the obvious tool of choice to design such protocols. However, surprisingly, we show that most natural KD-based federated learning protocols have poor performance. To investigate this, we propose a new theoretical framework, Federated Kernel ridge regression, which can capture both model heterogeneity as well as data heterogeneity. Our analysis shows that the degradation is largely due to a fundamental limitation of knowledge distillation under data heterogeneity. We further validate our framework by analyzing and designing new protocols based on KD. Their performance on real world experiments using neural networks, though still unsatisfactory, closely matches our theoretical predictions.  ( 2 min )
    Benchmarking Graph Neural Networks. (arXiv:2003.00982v4 [cs.LG] UPDATED)
    In the last few years, graph neural networks (GNNs) have become the standard toolkit for analyzing and learning from data on graphs. This emerging field has witnessed an extensive growth of promising techniques that have been applied with success to computer science, mathematics, biology, physics and chemistry. But for any successful field to become mainstream and reliable, benchmarks must be developed to quantify progress. This led us in March 2020 to release a benchmark framework that i) comprises of a diverse collection of mathematical and real-world graphs, ii) enables fair model comparison with the same parameter budget to identify key architectures, iii) has an open-source, easy-to-use and reproducible code infrastructure, and iv) is flexible for researchers to experiment with new theoretical ideas. As of May 2022, the GitHub repository has reached 1,800 stars and 339 forks, which demonstrates the utility of the proposed open-source framework through the wide usage by the GNN community. In this paper, we present an updated version of our benchmark with a concise presentation of the aforementioned framework characteristics, an additional medium-sized molecular dataset AQSOL, similar to the popular ZINC, but with a real-world measured chemical target, and discuss how this framework can be leveraged to explore new GNN designs and insights. As a proof of value of our benchmark, we study the case of graph positional encoding (PE) in GNNs, which was introduced with this benchmark and has since spurred interest of exploring more powerful PE for Transformers and GNNs in a robust experimental setting.  ( 3 min )
    Stochastic Variational Smoothed Model Checking. (arXiv:2205.05398v1 [cs.LG])
    Model-checking for parametric stochastic models can be expressed as checking the satisfaction probability of a certain property as a function of the parameters of the model. Smoothed model checking (smMC) leverages Gaussian Processes (GP) to infer the satisfaction function over the entire parameter space from a limited set of observations obtained via simulation. This approach provides accurate reconstructions with statistically sound quantification of the uncertainty. However, it inherits the scalability issues of GP. In this paper, we exploit recent advances in probabilistic machine learning to push this limitation forward, making Bayesian inference of smMC scalable to larger datasets, enabling its application to larger models in terms of the dimension of the parameter set. We propose Stochastic Variational Smoothed Model Checking (SV-smMC), a solution that exploits stochastic variational inference (SVI) to approximate the posterior distribution of the smMC problem. The strength and flexibility of SVI make SV-smMC applicable to two alternative probabilistic models: Gaussian Processes (GP) and Bayesian Neural Networks (BNN). Moreover, SVI makes inference easily parallelizable and it enables GPU acceleration. In this paper, we compare the performances of smMC against those of SV-smMC by looking at the scalability, the computational efficiency and at the accuracy of the reconstructed satisfaction function.  ( 2 min )
    Externally Valid Treatment Choice. (arXiv:2205.05561v1 [econ.EM])
    We consider the problem of learning treatment (or policy) rules that are externally valid in the sense that they have welfare guarantees in target populations that are similar to, but possibly different from, the experimental population. We allow for shifts in both the distribution of potential outcomes and covariates between the experimental and target populations. This paper makes two main contributions. First, we provide a formal sense in which policies that maximize social welfare in the experimental population remain optimal for the "worst-case" social welfare when the distribution of potential outcomes (but not covariates) shifts. Hence, policy learning methods that have good regret guarantees in the experimental population, such as empirical welfare maximization, are externally valid with respect to a class of shifts in potential outcomes. Second, we develop methods for policy learning that are robust to shifts in the joint distribution of potential outcomes and covariates. Our methods may be used with experimental or observational data.  ( 2 min )
    AutoKE: An automatic knowledge embedding framework for scientific machine learning. (arXiv:2205.05390v1 [cs.LG])
    Imposing physical constraints on neural networks as a method of knowledge embedding has achieved great progress in solving physical problems described by governing equations. However, for many engineering problems, governing equations often have complex forms, including complex partial derivatives or stochastic physical fields, which results in significant inconveniences from the perspective of implementation. In this paper, a scientific machine learning framework, called AutoKE, is proposed, and a reservoir flow problem is taken as an instance to demonstrate that this framework can effectively automate the process of embedding physical knowledge. In AutoKE, an emulator comprised of deep neural networks (DNNs) is built for predicting the physical variables of interest. An arbitrarily complex equation can be parsed and automatically converted into a computational graph through the equation parser module, and the fitness of the emulator to the governing equation is evaluated via automatic differentiation. Furthermore, the fixed weights in the loss function are substituted with adaptive weights by incorporating the Lagrangian dual method. Neural architecture search (NAS) is also introduced into the AutoKE to select an optimal network architecture of the emulator according to the specific problem. Finally, we apply transfer learning to enhance the scalability of the emulator. In experiments, the framework is verified by a series of physical problems in which it can automatically embed physical knowledge into an emulator without heavy hand-coding. The results demonstrate that the emulator can not only make accurate predictions, but also be applied to similar problems with high efficiency via transfer learning.  ( 2 min )
    Probability Distribution of Hypervolume Improvement in Bi-objective Bayesian Optimization. (arXiv:2205.05505v1 [cs.LG])
    This work provides the exact expression of the probability distribution of the hypervolume improvement (HVI) for bi-objective generalization of Bayesian optimization. Here, instead of a single-objective improvement, we consider the improvement of the hypervolume indicator concerning the current best approximation of the Pareto front. Gaussian process regression models are trained independently on both objective functions, resulting in a bi-variate separated Gaussian distribution serving as a predictive model for the vector-valued objective function. Some commonly HVI-based acquisition functions (probability of improvement and upper confidence bound) are also leveraged with the help of the exact distribution of HVI. In addition, we show the superior numerical accuracy and efficiency of the exact distribution compared to the commonly used approximation by Monte-Carlo sampling. Finally, we benchmark distribution-leveraged acquisition functions on the widely applied ZDT problem set, demonstrating a significant advantage of using the exact distribution of HVI in multi-objective Bayesian optimization.  ( 2 min )
    Exploring Local Explanations of Nonlinear Models Using Animated Linear Projections. (arXiv:2205.05359v1 [stat.ML])
    The increased predictive power of nonlinear models comes at the cost of interpretability of its terms. This trade-off has led to the emergence of eXplainable AI (XAI). XAI attempts to shed light on how models use predictors to arrive at a prediction with local explanations, a point estimate of the linear feature importance in the vicinity of one instance. These can be considered linear projections and can be further explored to understand better the interactions between features used to make predictions across the predictive model surface. Here we describe interactive linear interpolation used for exploration at any instance and illustrate with examples with categorical (penguin species, chocolate types) and quantitative (soccer/football salaries, house prices) output. The methods are implemented in the R package cheem, available on CRAN.  ( 2 min )
    Variational Autoencoder Leveraged MMSE Channel Estimation. (arXiv:2205.05345v1 [eess.SP])
    We propose to utilize a variational autoencoder (VAE) for data-driven channel estimation. The underlying true and unknown channel distribution is modeled by the VAE as a conditional Gaussian distribution in a novel way, parameterized by the respective first and second order conditional moments. As a result, it can be observed that the linear minimum mean square error (LMMSE) estimator in its variant conditioned on the latent sample of the VAE approximates an optimal MSE estimator. Furthermore, we argue how a VAE-based channel estimator can approximate the MMSE channel estimator. We propose three variants of VAE estimators that differ in the data used during training and estimation. First, we show that given perfectly known channel state information at the input of the VAE during estimation, which is impractical, we obtain an estimator that can serve as a benchmark result for an estimation scenario. We then propose practically feasible approaches, where perfectly known channel state information is only necessary in the training phase or is not needed at all. Simulation results on 3GPP and QuaDRiGa channel data attest a small performance loss of the practical approaches and the superiority of our VAE approaches in comparison to other related channel estimation methods.  ( 2 min )
    Learning Multitask Gaussian Bayesian Networks. (arXiv:2205.05343v1 [stat.ML])
    Major depressive disorder (MDD) requires study of brain functional connectivity alterations for patients, which can be uncovered by resting-state functional magnetic resonance imaging (rs-fMRI) data. We consider the problem of identifying alterations of brain functional connectivity for a single MDD patient. This is particularly difficult since the amount of data collected during an fMRI scan is too limited to provide sufficient information for individual analysis. Additionally, rs-fMRI data usually has the characteristics of incompleteness, sparsity, variability, high dimensionality and high noise. To address these problems, we proposed a multitask Gaussian Bayesian network (MTGBN) framework capable for identifying individual disease-induced alterations for MDD patients. We assume that such disease-induced alterations show some degrees of similarity with the tool to learn such network structures from observations to understanding of how system are structured jointly from related tasks. First, we treat each patient in a class of observation as a task and then learn the Gaussian Bayesian networks (GBNs) of this data class by learning from all tasks that share a default covariance matrix that encodes prior knowledge. This setting can help us to learn more information from limited data. Next, we derive a closed-form formula of the complete likelihood function and use the Monte-Carlo Expectation-Maximization(MCEM) algorithm to search for the approximately best Bayesian network structures efficiently. Finally, we assess the performance of our methods with simulated and real-world rs-fMRI data.  ( 2 min )
    Analysis of convolutional neural network image classifiers in a rotationally symmetric model. (arXiv:2205.05500v1 [stat.ML])
    Convolutional neural network image classifiers are defined and the rate of convergence of the misclassification risk of the estimates towards the optimal misclassification risk is analyzed. Here we consider images as random variables with values in some functional space, where we only observe discrete samples as function values on some finite grid. Under suitable structural and smoothness assumptions on the functional a posteriori probability, which includes some kind of symmetry against rotation of subparts of the input image, it is shown that least squares plug-in classifiers based on convolutional neural networks are able to circumvent the curse of dimensionality in binary image classification if we neglect a resolution-dependent error term. The finite sample size behavior of the classifier is analyzed by applying it to simulated and real data.  ( 2 min )
    On Distributed Adaptive Optimization with Gradient Compression. (arXiv:2205.05632v1 [stat.ML])
    We study COMP-AMS, a distributed optimization framework based on gradient averaging and adaptive AMSGrad algorithm. Gradient compression with error feedback is applied to reduce the communication cost in the gradient transmission process. Our convergence analysis of COMP-AMS shows that such compressed gradient averaging strategy yields same convergence rate as standard AMSGrad, and also exhibits the linear speedup effect w.r.t. the number of local workers. Compared with recently proposed protocols on distributed adaptive methods, COMP-AMS is simple and convenient. Numerical experiments are conducted to justify the theoretical findings, and demonstrate that the proposed method can achieve same test accuracy as the full-gradient AMSGrad with substantial communication savings. With its simplicity and efficiency, COMP-AMS can serve as a useful distributed training framework for adaptive gradient methods.  ( 2 min )
    Stable and Interpretable Unrolled Dictionary Learning. (arXiv:2106.00058v4 [cs.LG] UPDATED)
    The dictionary learning problem, representing data as a combination of a few atoms, has long stood as a popular method for learning representations in statistics and signal processing. The most popular dictionary learning algorithm alternates between sparse coding and dictionary update steps, and a rich literature has studied its theoretical convergence. The success of dictionary learning relies on access to a ``good'' initial estimate of the dictionary and the ability of the sparse coding step to provide an unbiased estimate of the code. The growing popularity of unrolled sparse coding networks has led to the empirical finding that backpropagation through such networks performs dictionary learning. We offer the first theoretical analysis of these empirical results through PUDLE, a Provable Unrolled Dictionary LEarning method. We provide conditions on the network initialization and data distribution sufficient to recover and preserve the support of the latent sparse representation. Additionally, we address two challenges; first, the vanilla unrolled sparse coding computes a biased code estimate, and second, gradients during backpropagated learning can become unstable. We show approaches to reduce the bias of the code estimate in the forward pass, and that of the dictionary estimate in the backward pass. We propose strategies to resolve the learning instability. This is achieved by tuning network parameters and modifying the loss function. Overall, we highlight the impact of loss, unrolling, and backpropagation on convergence. We complement our findings through synthetic and image denoising experiments. Finally, we demonstrate PUDLE's interpretability, a driving factor in designing deep networks based on iterative optimizations, by building a mathematical relation between network weights, its output, and the training set.  ( 2 min )
    A Framework for Machine Learning of Model Error in Dynamical Systems. (arXiv:2107.06658v2 [math.DS] UPDATED)
    The development of data-informed predictive models for dynamical systems is of widespread interest in many disciplines. We present a unifying framework for blending mechanistic and machine-learning approaches to identify dynamical systems from noisily and partially observed data. We compare pure data-driven learning with hybrid models which incorporate imperfect domain knowledge. Our formulation is agnostic to the chosen machine learning model, is presented in both continuous- and discrete-time settings, and is compatible both with model errors that exhibit substantial memory and errors that are memoryless. First, we study memoryless linear (w.r.t. parametric-dependence) model error from a learning theory perspective, defining excess risk and generalization error. For ergodic continuous-time systems, we prove that both excess risk and generalization error are bounded above by terms that diminish with the square-root of T, the time-interval over which training data is specified. Secondly, we study scenarios that benefit from modeling with memory, proving universal approximation theorems for two classes of continuous-time recurrent neural networks (RNNs): both can learn memory-dependent model error. In addition, we connect one class of RNNs to reservoir computing, thereby relating learning of memory-dependent error to recent work on supervised learning between Banach spaces using random features. Numerical results are presented (Lorenz '63, Lorenz '96 Multiscale systems) to compare purely data-driven and hybrid approaches, finding hybrid methods less data-hungry and more parametrically efficient. Finally, we demonstrate numerically how data assimilation can be leveraged to learn hidden dynamics from noisy, partially-observed data, and illustrate challenges in representing memory by this approach, and in the training of such models.  ( 2 min )
    AutoTransfer: Subject Transfer Learning with Censored Representations on Biosignals Data. (arXiv:2112.09796v2 [cs.LG] UPDATED)
    We provide a regularization framework for subject transfer learning in which we seek to train an encoder and classifier to minimize classification loss, subject to a penalty measuring independence between the latent representation and the subject label. We introduce three notions of independence and corresponding penalty terms using mutual information or divergence as a proxy for independence. For each penalty term, we provide several concrete estimation algorithms, using analytic methods as well as neural critic functions. We provide a hands-off strategy for applying this diverse family of regularization algorithms to a new dataset, which we call "AutoTransfer". We evaluate the performance of these individual regularization strategies and our AutoTransfer method on EEG, EMG, and ECoG datasets, showing that these approaches can improve subject transfer learning for challenging real-world datasets.  ( 2 min )
    Contextual Search in the Presence of Adversarial Corruptions. (arXiv:2002.11650v5 [cs.LG] UPDATED)
    We study contextual search, a generalization of binary search in higher dimensions, which captures settings such as feature-based dynamic pricing. Standard formulations of this problem assume that agents act in accordance with a specific homogeneous response model. In practice, however, some responses may be adversarially corrupted. Existing algorithms heavily depend on the assumed response model being (approximately) accurate for all agents and have poor performance in the presence of even a few such arbitrary misspecifications. We initiate the study of contextual search when some of the agents can behave in ways inconsistent with the underlying response model. In particular, we provide two algorithms, one based on multidimensional binary search methods and one based on gradient descent. We show that these algorithms attain near-optimal regret in the absence of adversarial corruptions and their performance degrades gracefully with the number of such agents, providing the first results for contextual search in any adversarial noise model. Our techniques draw inspiration from learning theory, game theory, high-dimensional geometry, and convex analysis.  ( 2 min )
    Eliminating Sharp Minima from SGD with Truncated Heavy-tailed Noise. (arXiv:2102.04297v4 [cs.LG] UPDATED)
    The empirical success of deep learning is often attributed to SGD's mysterious ability to avoid sharp local minima in the loss landscape, as sharp minima are known to lead to poor generalization. Recently, empirical evidence of heavy-tailed gradient noise was reported in many deep learning tasks, and it was shown in \c{S}im\c{s}ekli (2019a,b) that SGD can escape sharp local minima under the presence of such heavy-tailed gradient noise, providing a partial solution to the mystery. In this work, we analyze a popular variant of SGD where gradients are truncated above a fixed threshold. We show that it achieves a stronger notion of avoiding sharp minima: it can effectively eliminate sharp local minima entirely from its training trajectory. We characterize the dynamics of truncated SGD driven by heavy-tailed noises. First, we show that the truncation threshold and width of the attraction field dictate the order of the first exit time from the associated local minimum. Moreover, when the objective function satisfies appropriate structural conditions, we prove that as the learning rate decreases, the dynamics of heavy-tailed truncated SGD closely resemble those of a continuous-time Markov chain that never visits any sharp minima. Real data experiments on deep learning confirm our theoretical prediction that heavy-tailed SGD with gradient clipping finds a "flatter" local minima and achieves better generalization.  ( 2 min )
    DoubleMatch: Improving Semi-Supervised Learning with Self-Supervision. (arXiv:2205.05575v1 [cs.LG])
    Following the success of supervised learning, semi-supervised learning (SSL) is now becoming increasingly popular. SSL is a family of methods, which in addition to a labeled training set, also use a sizable collection of unlabeled data for fitting a model. Most of the recent successful SSL methods are based on pseudo-labeling approaches: letting confident model predictions act as training labels. While these methods have shown impressive results on many benchmark datasets, a drawback of this approach is that not all unlabeled data are used during training. We propose a new SSL algorithm, DoubleMatch, which combines the pseudo-labeling technique with a self-supervised loss, enabling the model to utilize all unlabeled data in the training process. We show that this method achieves state-of-the-art accuracies on multiple benchmark datasets while also reducing training times compared to existing SSL methods. Code is available at https://github.com/walline/doublematch.  ( 2 min )
    A Unified f-divergence Framework Generalizing VAE and GAN. (arXiv:2205.05214v1 [stat.ML])
    Developing deep generative models that flexibly incorporate diverse measures of probability distance is an important area of research. Here we develop an unified mathematical framework of f-divergence generative model, f-GM, that incorporates both VAE and f-GAN, and enables tractable learning with general f-divergences. f-GM allows the experimenter to flexibly design the f-divergence function without changing the structure of the networks or the learning procedure. f-GM jointly models three components: a generator, a inference network and a density estimator. Therefore it simultaneously enables sampling, posterior inference of the latent variable as well as evaluation of the likelihood of an arbitrary datum. f-GM belongs to the class of encoder-decoder GANs: our density estimator can be interpreted as playing the role of a discriminator between samples in the joint space of latent code and observed space. We prove that f-GM naturally simplifies to the standard VAE and to f-GAN as special cases, and illustrates the connections between different encoder-decoder GAN architectures. f-GM is compatible with general network architecture and optimizer. We leverage it to experimentally explore the effects -- e.g. mode collapse and image sharpness -- of different choices of f-divergence.  ( 2 min )
    Regression-based projection for learning Mori--Zwanzig operators. (arXiv:2205.05135v1 [math.DS])
    We propose to adopt statistical regression as the projection operator to enable data-driven learning of the operators in the Mori--Zwanzig formalism. We present a principled algorithm to extract the Markov and memory operators for any regression models. We show that the choice of linear regression results in a recently proposed data-driven learning algorithm based on Mori's projection operator, which can be considered as a higher-order approximate Koopman learning method. We show that more expressive, potentially nonlinear regression models naturally fill in the gap between the highly idealized and computationally efficient Mori's projection operator and the most optimal yet computationally infeasible Zwanzig projection operator. We performed numerical experiments and extracted the operators for an array of regression-based projections, including linear, polynomial, spline, and neural-network-based regression, showing a progressive improvement as the complexity of the regression model increased. Our proposition provides a general framework to extract memory-dependent corrections and can be readily applied to an array of data-driven learning methods for stationary dynamical systems in the literature.  ( 2 min )
    Bias and Priors in Machine Learning Calibrations for High Energy Physics. (arXiv:2205.05084v1 [hep-ph])
    Machine learning offers an exciting opportunity to improve the calibration of nearly all reconstructed objects in high-energy physics detectors. However, machine learning approaches often depend on the spectra of examples used during training, an issue known as prior dependence. This is an undesirable property of a calibration, which needs to be applicable in a variety of environments. The purpose of this paper is to explicitly highlight the prior dependence of some machine learning-based calibration strategies. We demonstrate how some recent proposals for both simulation-based and data-based calibrations inherit properties of the sample used for training, which can result in biases for downstream analyses. In the case of simulation-based calibration, we argue that our recently proposed Gaussian Ansatz approach can avoid some of the pitfalls of prior dependence, whereas prior-independent data-based calibration remains an open problem.  ( 2 min )
    Generalized Fast Multichannel Nonnegative Matrix Factorization Based on Gaussian Scale Mixtures for Blind Source Separation. (arXiv:2205.05330v1 [cs.SD])
    This paper describes heavy-tailed extensions of a state-of-the-art versatile blind source separation method called fast multichannel nonnegative matrix factorization (FastMNMF) from a unified point of view. The common way of deriving such an extension is to replace the multivariate complex Gaussian distribution in the likelihood function with its heavy-tailed generalization, e.g., the multivariate complex Student's t and leptokurtic generalized Gaussian distributions, and tailor-make the corresponding parameter optimization algorithm. Using a wider class of heavy-tailed distributions called a Gaussian scale mixture (GSM), i.e., a mixture of Gaussian distributions whose variances are perturbed by positive random scalars called impulse variables, we propose GSM-FastMNMF and develop an expectationmaximization algorithm that works even when the probability density function of the impulse variables have no analytical expressions. We show that existing heavy-tailed FastMNMF extensions are instances of GSM-FastMNMF and derive a new instance based on the generalized hyperbolic distribution that include the normal-inverse Gaussian, Student's t, and Gaussian distributions as the special cases. Our experiments show that the normalinverse Gaussian FastMNMF outperforms the state-of-the-art FastMNMF extensions and ILRMA model in speech enhancement and separation in terms of the signal-to-distortion ratio.  ( 2 min )

  • Open

    MindsEye Lite run multiple text-to-image models in one place
    submitted by /u/Illustrious_Row_9971 [link] [comments]
    Terraform Provider Iterative (TPI) Helps ML/AI Teams Save Resources And Money While Switching Cloud Providers
    submitted by /u/thumbsdrivesmecrazy [link] [comments]
    What is the best learning class for an IA to play a Cards Game?
    I'm doing my project and I'm not sure what apprenticeship is the best for this kind of games what do you think? submitted by /u/Merelex_Buk-12 [link] [comments]
    Aimlflow - Aim-based experiment comparison over your MLflow logs
    🙌 Hey all, check out our work on AiMLflow! We are building a tool that mounts the MLflow logs and enables an Aim-based super-performant UI for metric, image and other ML metadata comparison. If you are using MLflow, we would love to chat with you and share our progress on the project for feedback. https://aimstack.io/aimlflow submitted by /u/ManeSa [link] [comments]  ( 1 min )
    This UBIAI article and tutorial explains the prediction of named entity recognition ( NER) model using local interpretable model-agnostic explanations ( LIME) Check more articles and tutorials at : ubiai.tools
    submitted by /u/UBIAI [link] [comments]  ( 1 min )
    Use Of AI To Reduce Street Crimes And Improving Civil Security
    In this digital era, several law enforcement agencies across the globe are leveraging artificial intelligence (AI) to resolve more criminal cases in a very short time equipped with AI algorithms developed to identify, locate and arrest the real or potential criminals faster than ever. Read more submitted by /u/ridamughal110 [link] [comments]  ( 1 min )
    Is AI eco-friendly?
    Artificial Intelligence (AI) systems can be used to reduce carbon emissions and eventually environmental degradation. It has the potential to understand the surroundings autonomously, learn from experience, make and implement decisions, and communicate with the surroundings. Read more submitted by /u/ridamughal110 [link] [comments]
    Interview with a computer scientist who works on AI & social/crowd computing. We speak about her career & life. If you are interested please subscribe as I'll be posting more AI interviews soon! :)
    submitted by /u/joemurray1994 [link] [comments]  ( 1 min )
    What is the difference between face detection and face recognition from an iOS perspective?
    submitted by /u/adilonreddit1 [link] [comments]  ( 1 min )
    Awesome Artificial Intelligence content (with code!) on Computer Vision News of May 2022
    Dear all, Awesome AI R&D content (with code!) on Computer Vision News of May 2022. Many great articles about AI, Deep Learning, Computer Vision and more... Review of award-winning ISBI 2022 paper. HTML5 version (recommended) PDF version Dilbert on page 2. Free subscription on page 60. Enjoy! https://preview.redd.it/gcc0qyxozuy81.jpg?width=400&format=pjpg&auto=webp&s=3a5d7913b9cc70bf25762d05f724115fd04d0123 submitted by /u/Gletta [link] [comments]  ( 1 min )
    AR/VR: The Next Frontier in Banking and Financial Services
    https://blog.r2c.io/ar-vr-the-next-frontier-in-banking-and-financial-services/ submitted by /u/R2Consulting [link] [comments]
    Some results look interesting!
    submitted by /u/limapedro [link] [comments]
    100+ Cheat Sheets for Machine Learning, Deep Learning and Data Science
    submitted by /u/vadhavaniyafaijan [link] [comments]
    Speech Recognition with TensorFlow.js - Voice Commands
    submitted by /u/RubiksCodeNMZ [link] [comments]
    The results of the AI experiment/survey I conducted on this sub a short time ago are here (link to the full study in the comment)
    submitted by /u/KazRainer [link] [comments]  ( 1 min )
    [P] PaddleOCR: An awesome and easy-to-use OCR system, which provides more than 80 kinds of multi-language recognition models.
    Hi, all, I am glad to share an open source repository PaddleOCR, which provides more than 80 kinds of multi-language recognition models, including English, Chinese, French, German, Arabic, Korean, Japanese and so on. Code:https://github.com/PaddlePaddle/PaddleOCR ​ Features Set: Ultra-lightweight OCR system: detection (3.6M) + direction classifier (1.4M) + recognition (12M) = 17.0M Support more than 80 kinds of multi-language recognition models, including English, Chinese, French, German, Arabic, Korean, Japanese and so on Semi-automatic data annotation tool PPOCRLabel: support rectangular box, irregular text, table and key information annotation modes Data synthesis tool, i.e., Style-Text: easy to synthesize a large number of images which are similar to the target scene image Support PIP installation, easy to use Support Linux, Windows, MacOS and other systems Apache-2.0 license ​ https://preview.redd.it/dkjcp8ln2ty81.png?width=795&format=png&auto=webp&s=e3751349712c7006c94e744f240f9207ad8d1b2d https://preview.redd.it/i95vto1p2ty81.png?width=795&format=png&auto=webp&s=1a1c3b58ccfb5f8fc403ba387a3a9fda6587e33d https://preview.redd.it/swlmsylm2ty81.png?width=795&format=png&auto=webp&s=a5567cbaa76e395330f75e9eba09100655974725 submitted by /u/Aha_IamDaniel [link] [comments]  ( 1 min )
    MELODIES POSITIVE: Bottle Art.
    submitted by /u/cookingandcraft [link] [comments]
    Self-hosted 'replika-like' chatbot using VA Framework and Oryzer Studio
    I purchased VA Framework Standard Edition at a great price during the Steam Holiday sale at Christmas and played around with it a bit when I first got it, but it has been sitting idle since then and I decided to start messing with it again. What attracted me to this was being combined with Oryzer Studio and its flow-based programming and already being attached to a 3D avatar interface with speech engine. What I am trying to do is create a personal self-hosted 'replika-like' chatbot that can be trained and learn from my input. Based on my research and communication with the developer prior to purchasing this, what I am hoping to accomplish should easily be obtainable using this. The problem is that the documentation and sample projects available are very sparse and what is available seems to be geared more towards corporate customer systems rather than personal systems. I have limited programming experience, mostly Visual Basic from when I was in college, which is why Oryzer studio seemed perfect for me. What I am hoping is that there is someone else who has experience with it and may be able to assist with getting me started or at least point me in the direction of some documentation and/or sample projects, other than the ones provided on the developer's website, that I can build off of. VA Framework, which I purchased, is the 3D Avatar and speech engine and Oryzer Studio, which is free, is used to create the 'workspace' files used by VA Framework. Both are designed to work together, although Oryzer Studio does not require VA Framework to function and is freely available to anyone who wanted to look at it. I thank you in advance for any assistance you can provide submitted by /u/Normmstein [link] [comments]  ( 1 min )
    How to Make Slow Motion Videos With AI ! TimeLens Explained
    submitted by /u/OnlyProggingForFun [link] [comments]
  • Open

    [D] What is the best machine learning algorithm to classify seemingly random data?
    I am a student Computer science. For my bachelor's internship I have to research the possibillity of predicting whether a new client will buy a house or not based on their website-activity/behaviour so the real-estate company can prioritise the most likely buyers. The real-estate company assured me they had loads of data but nope.... . I have data of about 1400 clients of which 150 bought a house. When we learn about new algorithms we usually have a perfect dataset to work with but this is the real world where everything is chaos. I've tried KNN, SVM, Gaussian mixture and XGboost. But they all underperform because there is very little data and all of it is seemingly random. So What is the best machine learning algorithm to classify seemingly random data? Or is it impossible to get any reasonable prediction out of it? For my research it is however perfectly acceptable to conclude it to be impossible to predict anything given these circumstances. Thanks in advance! submitted by /u/Fine-Coffee6171 [link] [comments]  ( 2 min )
    [D] Any recommendations for image annotation software .
    I've been using Supervisely Enterprise in work and it's been going really really well, but my work has been paying for it. Meanwhile, my grad research work is requiring a large-scale annotation effort and my advisor is queasy about forking over grant money just to annotate. We can't use the community version because HIPAA. I've been looking at other free annotation software and I'm trying to make a decision on‍ which to use. Any suggestions on what has worked for you? submitted by /u/aygupt1822 [link] [comments]  ( 1 min )
    [D] I’m writing a msc thesis on synthetic data for image classification. I am wondering which kinds of data visualization techniques (of real data)/EDA I could add.
    The task is recognition of jersey numbers on the back of sport players. The dataset contains the picture, the label and a bounding box containing the jersey number. The only ideas I had was plotting the distribution of the bounding box in the image, and the histogram of labels. I am not sure if it makes sense to plot the distribution of image resolutions. submitted by /u/TheManveru [link] [comments]  ( 2 min )
    [P] Nftopia.ai - Visual semantic search for NFTs
    Hi! My colleagues & I indexed about 1.1 million NFT images from across 21 thousand collections using OpenAI's CLIP and put it behind a website. Please let us know what you think! Specifically, we embed all these images using clip & expose two functionalities. The search bar embeds the input query & retrieves similar images. The "+more like this" is image search that retrieves other NFTs similar to it in the image space. We use Pinecone to power the approximate nearest neighbor search. The web stack uses Django in the front + flask app that interfaces with Pinecone. (Warning: we've noticed several NSFW NFTs that can pop up occasionally even for unrelated search queries) submitted by /u/hayAbhay [link] [comments]  ( 1 min )
    [N] Awesome AI R&D content (with code!) on Computer Vision News of May 2022
    Dear all, Awesome R&D content (with code!) on Computer Vision News of May 2022. Many great articles about AI, Deep Learning, Computer Vision and more... Review of award-winning ISBI 2022 paper. HTML5 version (recommended) PDF version Dilbert on page 2. Free subscription on page 60. Enjoy! https://preview.redd.it/cdg07kpfnuy81.jpg?width=400&format=pjpg&auto=webp&s=9ff68e58648a0dafc95b9416bc1a7e194ab1b898 submitted by /u/Gletta [link] [comments]  ( 1 min )
    [D] Can you recommend this book? Analytical Skills for AI and Data Science
    Recently, I am stumbling frequently across this book: https://www.oreilly.com/library/view/analytical-skills-for/9781492060932/. I am thinking about buying it. Are here some people who own this book and can give some recommendations? submitted by /u/rightkill [link] [comments]  ( 1 min )
    [P] Recognize similar nodes in chronologically evolving graph
    I’ve collected a data set of around 5 million unique ids which transact values with each other in one directional ways at some point of time. I understand that this can easily be visualized as a graph where each node is an id and interaction is a weighted edge assigned a tuple (value, time). I now want to build a model that finds similar behaving nodes. The goal is to look at the current state of the graph and use the model to predict what’s likely to happen with nodes that have joined the graph relatively late (=first transaction isnt old) Im sure there has been work done that can help me find hints how to do this. Are there papers/models you would you recommend to look at? submitted by /u/SomewhereOld6859 [link] [comments]  ( 1 min )
    [R] [CVPR 2022 Oral] Learning Multi-View Aggregation In the Wild for Large-Scale 3D Semantic Segmentation
    We present an end-to-end deep view aggregation method for 3D semantic segmentation from images and point clouds. We reach SOTA on S3DIS and KITTI360 without requiring point cloud colorization, meshing, or depth sensors: just point clouds, images, and their poses. preprint | code | paperwithcode submitted by /u/grad_student_descent [link] [comments]  ( 1 min )
    [P] I wrote an article on Uplift Modeling to target the "persuadables"
    Hey everyone! As the title says, I recently wrote a technical article where I built an uplift model to increase marketing ROIs by targeting the right group of people (the persuadables). Here's the link: https://towardsdatascience.com/targeting-the-right-group-with-uplift-modelling-5682de2dff8b Curious to know if anyone has successfully used uplift modeling in their industry or field? And let me know what you think. Any feedback is greatly appreciated! submitted by /u/ramanansubramanian [link] [comments]  ( 1 min )
    [D] Is there a tool to visualise my neural network in real time?
    I would to know if there are any tool that I can use in order to visualise my NN in real time noting that I use keras. I want something like shown in this video screenshot : ​ https://preview.redd.it/yut97aa7ysy81.png?width=1969&format=png&auto=webp&s=6704667a483810d1614c90075335faa75e1ec0e7 submitted by /u/aymenboufe [link] [comments]  ( 1 min )
    [D] Is there any way to artificially create a probability calibration for data coming from another model?
    I have probability predictions, which come from a survival model, this model gives me very low probabilities, and I am not sure if they fulfill the real probability of the phenomenon. For example, I calculate P(T≤t+d|T>t) and the probabilities are very low (with d=180). To summarize, I need these probabilities to be on average another number (let's say 0.2). Is it possible to create an artificial calibration with only this number (the desired average) as the input? I have thought of creating a vector of size n equal to the size of the original dataset with distribution Xi∼Bernoulli(p=0.2) and assign its ones to the top np probabilities and its zeros to the latest n(1−p). Which would result in a table with a column of probabilities obtained with the survival model and another column with a 0 or 1 depending on the said probability. After getting this table, I would simply use CalibratedClassifierCV from scikit-learn. Could this be the correct way? submitted by /u/Jhones_edlc [link] [comments]  ( 1 min )
    [D] Fastest way to calculate distance (drift) between vectors - at scale (billions)
    I'm looking for the fastest way to calculate distance (drift) between vectors - at scale (billions). What do I mean? For example, there are two sets of 1 billions vectors, each. I want to know how far/different one of the sets is from the other. Each vector has 768 dimensions. submitted by /u/igaloly [link] [comments]  ( 2 min )
    [R][D] Cluster Sentences
    I have a use-case where I need to cluster/combine related sentences together. For example : Sentences ​ https://preview.redd.it/suq97dy5bsy81.png?width=683&format=png&auto=webp&s=7d2a3168e755a8fb390e384481f4a7e37a84966f Related sentences together ​ https://preview.redd.it/absdu995bsy81.png?width=683&format=png&auto=webp&s=5b9285b3b0d1203e3e50457447a1086cae9c821c Constraint: We do not know the number of clusters beforehand (K-Means cluster is not much useful here) ​ The following does not give satisfactory solution https://ntropy.com/post/clustering-millions-of-sentences-to-optimize-the-ml-workflow Can anyone please help with the approach/pointers? submitted by /u/Expert-Departure-236 [link] [comments]
    [R] Full-batch GD generalizes better than SGD
    This paper https://arxiv.org/abs/2204.12446 shows the dependence of generalization error with respect to the optimization error. For smooth losses ( examples : log sum exp, or smoothed leaky relu activation functions), full-batch GD generalizes better than SGD (or at least compared to known generalization error bounds). Additionally, in the over-parametrized regime (exact fit) the Polyak-Lojasiewicz condition holds (https://arxiv.org/abs/2003.00307,) and T= C log(n) iterations are required to achieve generalization and excess risk of the order 1/n^2. In practice, this should be translated also as "train longer, to generalize better" and "increase the batch size to generalize better". See also the tables of the results below. https://preview.redd.it/6ska4l52mry81.jpg?width=1165&format=pjpg&auto=webp&s=222043c8152761f89c6855fd26cba0b4f88443e7 https://preview.redd.it/k038am52mry81.jpg?width=1184&format=pjpg&auto=webp&s=6155f8a5022199dec0c0c3341a962d05825cc867 submitted by /u/chaotic_shadow4444 [link] [comments]  ( 4 min )
    [P] Music mixing and AI
    I was wondering what would be the best approach to develop an AI agent that automatically mix music with sound and mix effect i.e. artificial DJ ? submitted by /u/big_dataFitness [link] [comments]  ( 1 min )
  • Open

    Animo Island, a PC game that empowers players to explore reinforcement learning as a game mechanic
    Transitional Forms presents to you, Animo Island, a rogue-lite base builder that utilizes the power of reinforcement learning to enable YOU to perform and explore complex behaviours from training cute little agents called Animo! Through the process of rewarding and training your Animo, you can teach it to perform actions on the island from which it can learn to make better decisions over time. You can customize and train your Animo to gather crystals, pick apples, plant flowers...sky’s the limit as you explore new terrain and emergent behaviours! Animo Island is teeming with potential for you to create an ever changing and autonomous paradise for your Animo’s to cultivate and grow. Through cute, loveable characters and playful environments, Animo Island invites you to create empathetic connections with creative machine intelligence in a way that has never been done before. Join our Discord community to learn more about Animo Island and get all the latest updates and information! We are currently looking for passionate folks to be play testers for our upcoming beta release, make sure to drop us a line in Discord if you’re interested! Link to Discord: https://discord.gg/rkSXhKyDVR Subreddit: r/AnimoIsland submitted by /u/AnimoIsland [link] [comments]  ( 1 min )
    Basic question about importance sampling for off-policy Montecarlo Prediction.
    submitted by /u/jamespherman [link] [comments]  ( 1 min )
    How to set up a Gym environment for a two player game like Nine Men's Morris
    I'm working on an RL agent which plays the game Nine Men's Morris against an human player (and hopefully wins most of the time). At least that is the goal. Right now I'm struggling with the game logic itself and setting up an appropriate environment with gym. Question 1: If an action closes a mill, the same player has to take another action (remove one of the opponents pieces) - how do I set this up in the step()-Function? Currently, I have two ideas: Inside the step function require another action from the agent or change the action space so that every action contains an additional parameter of where to take a piece in case of a mill. The latter seems cleaner but also increases my action space further ... Question 2: Observation is a 7 x 7 player board where a 1 marks player 1, a -1 marks player 2 and zero is an empty field. There are 600 possible actions (place piece on one of 24 legal points, take a piece from one of 24 legal points, move a piece from 24 * 24 - 24 points). Is there any way to reduce the action space? Is 600 too large? Should my agent know what actions are legal or should it learn this? Question 3: My current understanding of gym is that the entire game logic is behind the step(action) function. If the agent chooses an illegal action, the game is immediately terminated and the agent receives the lose reward. Is this correct? Question 4: My first idea at training the agent is the invert the player board after each turn - so the agent plays against itself. I've read that this is probably unstable and won't produce great results, but as a first step it would probably be enough. Another idea is to play against an older version of the agent so that it gets better over time. Do these ideas seem reasonable? Furthemore, any resource (blog article, paper, tutorial ...) on this topic (using RL on two player games) would be greatly appreciated. submitted by /u/house_92 [link] [comments]  ( 2 min )
    "Data Distributional Properties Drive Emergent Few-Shot Learning in Transformers", Chan et al 2022
    submitted by /u/gwern [link] [comments]  ( 1 min )
    Changing the observation space from real valued quantities to visual obs
    I would like to change the observation of the MAP environment. If you look at the code (https://github.com/openai/multiagent-particle-envs/blob/47e9ee38e605f8a563370b3c7e52a349eca3f6b1/multiagent/environment.py#L69), this is how it is initialized: self.observation_space.append(spaces.Box(low=-np.inf, high=+np.inf, shape=(obs_dim,), dtype=np.float32)) Then, if you look at one of the actual envs, for example simple spread, this (https://github.com/openai/multiagent-particle-envs/blob/47e9ee38e605f8a563370b3c7e52a349eca3f6b1/multiagent/scenarios/simple_spread.py#L100) is what at each step the env gives as an observation to the agent: return np.concatenate([agent.state.p_vel] + [agent.state.p_pos] + entity_pos + other_pos + comm) Is there a way I can add an image of the environment to this np array? I would like my agent to also receive a visual obs. submitted by /u/No_Possibility_7588 [link] [comments]  ( 1 min )
    On the Verge of Solving Rocket League using Deep Reinforcement Learning and Sim-to-sim Transfer
    Paper: https://arxiv.org/abs/2205.05061 Videos: https://www.youtube.com/watch?v=8k9FNxIU0KQ Github: Coming soon Playlist: https://www.youtube.com/watch?v=WXMHJszkz6M&list=PL2KGNY2Ei3ix7Vr_vA-ZgCyVfOCfhbX0C submitted by /u/LilHairdy [link] [comments]  ( 1 min )
    Papers explaining the limitations of Q-learning and the deep Q-network
    Hi! I am currently working on my master thesis that includes custom-made environments for Q-learning and a deep Q-network from Mnih et al. (2015). One of the limitations of my custom-made environment is that the state and action space increases rapidly. For the Q-learning solution, the size of the Q-table becomes enormous with millions to billions of possible states, while the deep Q-network suffers from catastrophic forgetting. Furthermore, the size of the replay memory (1 million experiences) is not enough to efficiently deal with the rapid increase in state space. I know that "Implementing the Deep Q-Network" from Melrose Roderick, James MacGlashan, and Stefanie Tellex is a good paper explaining the limitations of DQN. My question is, does anyone know of more good papers like this? I am also interested in papers explaining the limitations of Q-learning. Thank you for any help! submitted by /u/Sondreeo [link] [comments]  ( 1 min )
  • Open

    Language Models Perform Reasoning via Chain of Thought
    Posted by Jason Wei and Denny Zhou, Research Scientists, Google Research, Brain team In recent years, scaling up the size of language models has been shown to be a reliable way to improve performance on a range of natural language processing (NLP) tasks. Today’s language models at the scale of 100B or more parameters achieve strong performance on tasks like sentiment analysis and machine translation, even with little or no training examples. Even the largest language models, however, can still struggle with certain multi-step reasoning tasks, such as math word problems and commonsense reasoning. How might we enable language models to perform such reasoning tasks? In “Chain of Thought Prompting Elicits Reasoning in Large Language Models,” we explore a prompting method for improving the re…  ( 6 min )
    Unlocking Zero-Resource Machine Translation to Support New Languages in Google Translate
    Posted by Isaac Caswell and Ankur Bapna, Research Scientists, Google Translate Machine translation (MT) technology has made significant advances in recent years, as deep learning has been integrated with natural language processing (NLP). Performance on research benchmarks like WMT have soared, and translation services have improved in quality and expanded to include new languages. Nevertheless, while existing translation services cover languages spoken by the majority of people world wide, they only include around 100 languages in total, just over 1% of those actively spoken globally. Moreover, the languages that are currently represented are overwhelmingly European, largely overlooking regions of high linguistic diversity, like Africa and the Americas. There are two key bottlenecks t…  ( 9 min )
  • Open

    Achieve in-vehicle comfort using personalized machine learning and Amazon SageMaker
    This blog post is co-written by Rudra Hota and Esaias Pech from Continental AG. Many drivers have had the experience of trying to adjust temperature settings in their vehicle while attempting to keep their eyes on the road. Whether the previous driver preferred a warmer cabin temperature, or you’re now wearing warmer clothing, or the […]  ( 8 min )
  • Open

    New Twitter account: ElementFact
    I started a new Twitter account this morning: @ElementFact. I’m thinking the account will post things like scientific facts about each element but also some history around how the element was discovered and named and other lore associated with the element. We’ll see how this goes. I’ve started many Twitter accounts over the years. Some […] New Twitter account: ElementFact first appeared on John D. Cook.  ( 1 min )
  • Open

    Transfer Learning — Part — 6.1!! Implementing Mobilenet in Keras
    In Part 6.0 of the Transfer Learning series we have discussed about Mobilenet pre-trained model in depth so in this series we will…  ( 89 min )
  • Open

    DSC Weekly Newsletter 10 May 2022: Data Meshes, Digital Twins, and Knowledge Graphs
    There’s an underlying theme to this week’s articles, which is a curious occurrence given that so much of our content is user-driven. That theme is the Value of Data. There is a tendency when looking at data in its various incarnations to view all data as somehow being valuable. Realistically, without rolling up sleeves and… Read More »DSC Weekly Newsletter 10 May 2022: Data Meshes, Digital Twins, and Knowledge Graphs The post DSC Weekly Newsletter 10 May 2022: Data Meshes, Digital Twins, and Knowledge Graphs appeared first on Data Science Central.  ( 3 min )
  • Open

    Learning Fast, Learning Slow: A General Continual Learning Method based on Complementary Learning System. (arXiv:2201.12604v2 [cs.LG] UPDATED)
    Humans excel at continually learning from an ever-changing environment whereas it remains a challenge for deep neural networks which exhibit catastrophic forgetting. The complementary learning system (CLS) theory suggests that the interplay between rapid instance-based learning and slow structured learning in the brain is crucial for accumulating and retaining knowledge. Here, we propose CLS-ER, a novel dual memory experience replay (ER) method which maintains short-term and long-term semantic memories that interact with the episodic memory. Our method employs an effective replay mechanism whereby new knowledge is acquired while aligning the decision boundaries with the semantic memories. CLS-ER does not utilize the task boundaries or make any assumption about the distribution of the data which makes it versatile and suited for "general continual learning". Our approach achieves state-of-the-art performance on standard benchmarks as well as more realistic general continual learning settings.  ( 2 min )
    Application of Transfer Learning and Ensemble Learning in Image-level Classification for Breast Histopathology. (arXiv:2204.08311v2 [cs.CV] UPDATED)
    Background: Breast cancer has the highest prevalence in women globally. The classification and diagnosis of breast cancer and its histopathological images have always been a hot spot of clinical concern. In Computer-Aided Diagnosis (CAD), traditional classification models mostly use a single network to extract features, which has significant limitations. On the other hand, many networks are trained and optimized on patient-level datasets, ignoring the application of lower-level data labels. Method: This paper proposes a deep ensemble model based on image-level labels for the binary classification of benign and malignant lesions of breast histopathological images. First, the BreaKHis dataset is randomly divided into a training, validation and test set. Then, data augmentation techniques are used to balance the number of benign and malignant samples. Thirdly, considering the performance of transfer learning and the complementarity between each network, VGG16, Xception, ResNet50, DenseNet201 are selected as the base classifiers. Result: In the ensemble network model with accuracy as the weight, the image-level binary classification achieves an accuracy of $98.90\%$. In order to verify the capabilities of our method, the latest Transformer and Multilayer Perception (MLP) models have been experimentally compared on the same dataset. Our model wins with a $5\%-20\%$ advantage, emphasizing the ensemble model's far-reaching significance in classification tasks. Conclusion: This research focuses on improving the model's classification performance with an ensemble algorithm. Transfer learning plays an essential role in small datasets, improving training speed and accuracy. Our model has outperformed many existing approaches in accuracy, providing a method for the field of auxiliary medical diagnosis.  ( 2 min )
    A Robust and Flexible EM Algorithm for Mixtures of Elliptical Distributions with Missing Data. (arXiv:2201.12020v2 [stat.ML] UPDATED)
    This paper tackles the problem of missing data imputation for noisy and non-Gaussian data. A classical imputation method, the Expectation Maximization (EM) algorithm for Gaussian mixture models, has shown interesting properties when compared to other popular approaches such as those based on k-nearest neighbors or on multiple imputations by chained equations. However, Gaussian mixture models are known to be non-robust to heterogeneous data, which can lead to poor estimation performance when the data is contaminated by outliers or follows non-Gaussian distributions. To overcome this issue, a new EM algorithm is investigated for mixtures of elliptical distributions with the property of handling potential missing data. This paper shows that this problem reduces to the estimation of a mixture of Angular Gaussian distributions under generic assumptions (i.e., each sample is drawn from a mixture of elliptical distributions, which is possibly different for one sample to another). In that case, the complete-data likelihood associated with mixtures of elliptical distributions is well adapted to the EM framework with missing data thanks to its conditional distribution, which is shown to be a multivariate $t$-distribution. Experimental results on synthetic data demonstrate that the proposed algorithm is robust to outliers and can be used with non-Gaussian data. Furthermore, experiments conducted on real-world datasets show that this algorithm is very competitive when compared to other classical imputation methods.  ( 2 min )
    A Communication-Efficient Distributed Gradient Clipping Algorithm for Training Deep Neural Networks. (arXiv:2205.05040v1 [cs.LG])
    In distributed training of deep neural networks or Federated Learning (FL), people usually run Stochastic Gradient Descent (SGD) or its variants on each machine and communicate with other machines periodically. However, SGD might converge slowly in training some deep neural networks (e.g., RNN, LSTM) because of the exploding gradient issue. Gradient clipping is usually employed to address this issue in the single machine setting, but exploring this technique in the FL setting is still in its infancy: it remains mysterious whether the gradient clipping scheme can take advantage of multiple machines to enjoy parallel speedup. The main technical difficulty lies in dealing with nonconvex loss function, non-Lipschitz continuous gradient, and skipping communication rounds simultaneously. In this paper, we explore a relaxed-smoothness assumption of the loss landscape which LSTM was shown to satisfy in previous works and design a communication-efficient gradient clipping algorithm. This algorithm can be run on multiple machines, where each machine employs a gradient clipping scheme and communicate with other machines after multiple steps of gradient-based updates. Our algorithm is proved to have $O\left(\frac{1}{N\epsilon^4}\right)$ iteration complexity for finding an $\epsilon$-stationary point, where $N$ is the number of machines. This indicates that our algorithm enjoys linear speedup. We prove this result by introducing novel analysis techniques of estimating truncated random variables, which we believe are of independent interest. Our experiments on several benchmark datasets and various scenarios demonstrate that our algorithm indeed exhibits fast convergence speed in practice and thus validates our theory.  ( 2 min )
    NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality. (arXiv:2205.04421v2 [eess.AS] UPDATED)
    Text to speech (TTS) has made rapid progress in both academia and industry in recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how to define/judge that quality and how to achieve it. In this paper, we answer these questions by first defining the human-level quality based on the statistical significance of subjective measure and introducing appropriate guidelines to judge it, and then developing a TTS system called NaturalSpeech that achieves human-level quality on a benchmark dataset. Specifically, we leverage a variational autoencoder (VAE) for end-to-end text to waveform generation, with several key modules to enhance the capacity of the prior from text and reduce the complexity of the posterior from speech, including phoneme pre-training, differentiable duration modeling, bidirectional prior/posterior modeling, and a memory mechanism in VAE. Experiment evaluations on popular LJSpeech dataset show that our proposed NaturalSpeech achieves -0.01 CMOS (comparative mean opinion score) to human recordings at the sentence level, with Wilcoxon signed rank test at p-level p >> 0.05, which demonstrates no statistically significant difference from human recordings for the first time on this dataset.  ( 2 min )
    Engineering flexible machine learning systems by traversing functionally invariant paths in weight space. (arXiv:2205.00334v2 [cs.LG] UPDATED)
    Deep neural networks achieve human-like performance on a variety of perceptual and decision making tasks. However, deep networks perform poorly when confronted with changing tasks or goals, and broadly fail to match the flexibility and robustness of human intelligence. Here, we develop a mathematical and algorithmic framework that enables continual training of deep neural networks on a broad range of objectives by defining path connected sets of neural networks that achieve equivalent functional performance on a given machine learning task while modulating network weights to achieve high-performance on a secondary objective. We view the weight space of a neural network as a curved Riemannian manifold and move a neural network along a functionally invariant path in weight space while searching for networks that satisfy a secondary objective. We introduce a path-sampling algorithm that trains networks with millions of weight parameters to learn a series of image classification tasks without performance loss. The algorithm generalizes to accommodate a range of secondary objectives including weight-pruning and weight diversification and exhibits state of the art performance on network compression and adversarial robustness benchmarks. Broadly, we demonstrate how the intrinsic geometry of machine learning problems can be harnessed to construct flexible and robust neural networks.  ( 2 min )
    Deep Learning Based Cloud Cover Parameterization for ICON. (arXiv:2112.11317v2 [physics.ao-ph] UPDATED)
    A promising approach to improve cloud parameterizations within climate models and thus climate projections is to use deep learning in combination with training data from storm-resolving model (SRM) simulations. The ICOsahedral Non-hydrostatic (ICON) modeling framework permits simulations ranging from numerical weather prediction to climate projections, making it an ideal target to develop neural network (NN) based parameterizations for sub-grid scale processes. Within the ICON framework, we train NN based cloud cover parameterizations with coarse-grained data based on realistic regional and global ICON SRM simulations. We set up three different types of NNs that differ in the degree of vertical locality they assume for diagnosing cloud cover from coarse-grained atmospheric state variables. The NNs accurately estimate sub-grid scale cloud cover from coarse-grained data that has similar geographical characteristics as their training data. Additionally, globally trained NNs can reproduce sub-grid scale cloud cover of the regional SRM simulation. Using the game-theory based interpretability library SHapley Additive exPlanations, we identify an overemphasis on specific humidity and cloud ice as the reason why our column-based NN cannot perfectly generalize from the global to the regional coarse-grained SRM data. The interpretability tool also helps visualize similarities and differences in feature importance between regionally and globally trained column-based NNs, and reveals a local relationship between their cloud cover predictions and the thermodynamic environment. Our results show the potential of deep learning to derive accurate yet interpretable cloud cover parameterizations from global SRMs, and suggest that neighborhood-based models may be a good compromise between accuracy and generalizability.  ( 2 min )
    SmartSAGE: Training Large-scale Graph Neural Networks using In-Storage Processing Architectures. (arXiv:2205.04711v1 [cs.AR])
    Graph neural networks (GNNs) can extract features by learning both the representation of each objects (i.e., graph nodes) and the relationship across different objects (i.e., the edges that connect nodes), achieving state-of-the-art performance in various graph-based tasks. Despite its strengths, utilizing these algorithms in a production environment faces several challenges as the number of graph nodes and edges amount to several billions to hundreds of billions scale, requiring substantial storage space for training. Unfortunately, state-of-the-art ML frameworks employ an in-memory processing model which significantly hampers the productivity of ML practitioners as it mandates the overall working set to fit within DRAM capacity. In this work, we first conduct a detailed characterization on a state-of-the-art, large-scale GNN training algorithm, GraphSAGE. Based on the characterization, we then explore the feasibility of utilizing capacity-optimized NVM SSDs for storing memory-hungry GNN data, which enables large-scale GNN training beyond the limits of main memory size. Given the large performance gap between DRAM and SSD, however, blindly utilizing SSDs as a direct substitute for DRAM leads to significant performance loss. We therefore develop SmartSAGE, our software/hardware co-design based on an in-storage processing (ISP) architecture. Our work demonstrates that an ISP based large-scale GNN training system can achieve both high capacity storage and high performance, opening up opportunities for ML practitioners to train large GNN datasets without being hampered by the physical limitations of main memory size.  ( 2 min )
    An Algorithmic Framework for Bias Bounties. (arXiv:2201.10408v4 [cs.LG] UPDATED)
    We propose and analyze an algorithmic framework for "bias bounties": events in which external participants are invited to propose improvements to a trained model, akin to bug bounty events in software and security. Our framework allows participants to submit arbitrary subgroup improvements, which are then algorithmically incorporated into an updated model. Our algorithm has the property that there is no tension between overall and subgroup accuracies, nor between different subgroup accuracies, and it enjoys provable convergence to either the Bayes optimal model or a state in which no further improvements can be found by the participants. We provide formal analyses of our framework, experimental evaluation, and findings from a preliminary bias bounty event.  ( 2 min )
    RLFlow: Optimising Neural Network Subgraph Transformation with World Models. (arXiv:2205.01435v2 [cs.LG] UPDATED)
    Training deep learning models takes an extremely long execution time and consumes large amounts of computing resources. At the same time, recent research proposed systems and compilers that are expected to decrease deep learning models runtime. An effective optimisation methodology in data processing is desirable, and the reduction of compute requirements of deep learning models is the focus of extensive research. In this paper, we address the neural network sub-graph transformation by exploring reinforcement learning (RL) agents to achieve performance improvement. Our proposed approach RLFlow can learn to perform neural network subgraph transformations, without the need for expertly designed heuristics to achieve a high level of performance. Recent work has aimed at applying RL to computer systems with some success, especially using model-free RL techniques. Model-based reinforcement learning methods have seen an increased focus in research as they can be used to learn the transition dynamics of the environment; this can be leveraged to train an agent using a hallucinogenic environment such as World Model (WM), thereby increasing sample efficiency compared to model-free approaches. WM uses variational auto-encoders and it builds a model of the system and allows exploring the model in an inexpensive way. In RLFlow, we propose a design for a model-based agent with WM which learns to optimise the architecture of neural networks by performing a sequence of sub-graph transformations to reduce model runtime. We show that our approach can match the state-of-the-art performance on common convolutional networks and outperforms by up to 5% those based on transformer-style architectures  ( 2 min )
    Training Personalized Recommendation Systems from (GPU) Scratch: Look Forward not Backwards. (arXiv:2205.04702v1 [cs.AR])
    Personalized recommendation models (RecSys) are one of the most popular machine learning workload serviced by hyperscalers. A critical challenge of training RecSys is its high memory capacity requirements, reaching hundreds of GBs to TBs of model size. In RecSys, the so-called embedding layers account for the majority of memory usage so current systems employ a hybrid CPU-GPU design to have the large CPU memory store the memory hungry embedding layers. Unfortunately, training embeddings involve several memory bandwidth intensive operations which is at odds with the slow CPU memory, causing performance overheads. Prior work proposed to cache frequently accessed embeddings inside GPU memory as means to filter down the embedding layer traffic to CPU memory, but this paper observes several limitations with such cache design. In this work, we present a fundamentally different approach in designing embedding caches for RecSys. Our proposed ScratchPipe architecture utilizes unique properties of RecSys training to develop an embedding cache that not only sees the past but also the "future" cache accesses. ScratchPipe exploits such property to guarantee that the active working set of embedding layers can "always" be captured inside our proposed cache design, enabling embedding layer training to be conducted at GPU memory speed.  ( 2 min )
    Cognitive Visual-learning Environment for PostgreSQL. (arXiv:2205.04834v1 [cs.LG])
    PostgreSQL is an object-relational database (ORDBMS) that was introduced into the database community and has been avidly used for a variety of information extraction use cases. It is also known to be an advanced SQL-compliant open source Object RDBMS. However, users have not yet resolved to PostgreSQL due to the fact that it is still under the layers and the complexity of its persistent textual environment for an amateur user. Hence, there is a dire need to provide an easy environment for users to comprehend the procedure and standards with which databases are created, tables and the relationships among them, manipulating queries and their flow based on conditions in PostgreSQL. As such, this project identifies the dominant features offered by Postgresql, analyzes the constraints that exist in the database user community in migrating to PostgreSQL and based on the scope and constraints identified, develop a system that will serve as a query generation platform as well as a learning tool that will provide an interactive environment to cognitively learn PostgreSQL query building. This is achieved using a visual editor incorporating a textual editor for a well-versed user. By providing visually-draggable query components to work with, this research aims to offer a cognitive, visual and tactile environment where users can interactively learn PostgreSQL query generation.  ( 2 min )
    Learning Relative Return Policies With Upside-Down Reinforcement Learning. (arXiv:2202.12742v2 [cs.LG] UPDATED)
    Lately, there has been a resurgence of interest in using supervised learning to solve reinforcement learning problems. Recent work in this area has largely focused on learning command-conditioned policies. We investigate the potential of one such method -- upside-down reinforcement learning -- to work with commands that specify a desired relationship between some scalar value and the observed return. We show that upside-down reinforcement learning can learn to carry out such commands online in a tabular bandit setting and in CartPole with non-linear function approximation. By doing so, we demonstrate the power of this family of methods and open the way for their practical use under more complicated command structures.  ( 2 min )
    Federated Random Reshuffling with Compression and Variance Reduction. (arXiv:2205.03914v2 [cs.LG] UPDATED)
    Random Reshuffling (RR), which is a variant of Stochastic Gradient Descent (SGD) employing sampling without replacement, is an immensely popular method for training supervised machine learning models via empirical risk minimization. Due to its superior practical performance, it is embedded and often set as default in standard machine learning software. Under the name FedRR, this method was recently shown to be applicable to federated learning (Mishchenko et al.,2021), with superior performance when compared to common baselines such as Local SGD. Inspired by this development, we design three new algorithms to improve FedRR further: compressed FedRR and two variance reduced extensions: one for taming the variance coming from shuffling and the other for taming the variance due to compression. The variance reduction mechanism for compression allows us to eliminate dependence on the compression parameter, and applying additional controlled linear perturbations for Random Reshuffling, introduced by Malinovsky et al.(2021) helps to eliminate variance at the optimum. We provide the first analysis of compressed local methods under standard assumptions without bounded gradient assumptions and for heterogeneous data, overcoming the limitations of the compression operator. We corroborate our theoretical results with experiments on synthetic and real data sets.
    Automatic Sleep Staging of EEG Signals: Recent Development, Challenges, and Future Directions. (arXiv:2111.08446v3 [eess.SP] UPDATED)
    Modern deep learning holds a great potential to transform clinical practice on human sleep. Teaching a machine to carry out routine tasks would be a tremendous reduction in workload for clinicians. Sleep staging, a fundamental step in sleep practice, is a suitable task for this and will be the focus in this article. Recently, automatic sleep staging systems have been trained to mimic manual scoring, leading to similar performance to human sleep experts, at least on scoring of healthy subjects. Despite tremendous progress, we have not seen automatic sleep scoring adopted widely in clinical environments. This review aims to give a shared view of the authors on the most recent state-of-the-art development in automatic sleep staging, the challenges that still need to be addressed, and the future directions for automatic sleep scoring to achieve clinical value.
    Semantic features of object concepts generated with GPT-3. (arXiv:2202.03753v2 [cs.CL] UPDATED)
    Semantic features have been playing a central role in investigating the nature of our conceptual representations. Yet the enormous time and effort required to empirically sample and norm features from human raters has restricted their use to a limited set of manually curated concepts. Given recent promising developments with transformer-based language models, here we asked whether it was possible to use such models to automatically generate meaningful lists of properties for arbitrary object concepts and whether these models would produce features similar to those found in humans. To this end, we probed a GPT-3 model to generate semantic features for 1,854 objects and compared automatically-generated features to existing human feature norms. GPT-3 generated many more features than humans, yet showed a similar distribution in the types of generated features. Generated feature norms rivaled human norms in predicting similarity, relatedness, and category membership, while variance partitioning demonstrated that these predictions were driven by similar variance in humans and GPT-3. Together, these results highlight the potential of large language models to capture important facets of human knowledge and yield a new approach for automatically generating interpretable feature sets, thus drastically expanding the potential use of semantic features in psychological and linguistic studies.
    Category-orthogonal object features guide information processing in recurrent neural networks trained for object categorization. (arXiv:2111.07898v2 [cs.CV] UPDATED)
    Recurrent neural networks (RNNs) have been shown to perform better than feedforward architectures in visual object categorization tasks, especially in challenging conditions such as cluttered images. However, little is known about the exact computational role of recurrent information flow in these conditions. Here we test RNNs trained for object categorization on the hypothesis that recurrence iteratively aids object categorization via the communication of category-orthogonal auxiliary variables (the location, orientation, and scale of the object). Using diagnostic linear readouts, we find that: (a) information about auxiliary variables increases across time in all network layers, (b) this information is indeed present in the recurrent information flow, and (c) its manipulation significantly affects task performance. These observations confirm the hypothesis that category-orthogonal auxiliary variable information is conveyed through recurrent connectivity and is used to optimize category inference in cluttered environments.
    Tight Last-Iterate Convergence of the Extragradient and the Optimistic Gradient Descent-Ascent Algorithm for Constrained Monotone Variational Inequalities. (arXiv:2204.09228v2 [math.OC] UPDATED)
    The monotone variational inequality is a central problem in mathematical programming that unifies and generalizes many important settings such as smooth convex optimization, two-player zero-sum games, convex-concave saddle point problems, etc. The extragradient algorithm by Korpelevich [1976] and the optimistic gradient descent-ascent algorithm by Popov [1980] are arguably the two most classical and popular methods for solving monotone variational inequalities. Despite its long history, the following major problem remains open. What is the last-iterate convergence rate of the extragradient algorithm or the optimistic gradient descent-ascent algorithm for monotone and Lipschitz variational inequalities with constraints? We resolve this open problem by showing that both the extragradient algorithm and the optimistic gradient descent-ascent algorithm have a tight $O\left(\frac{1}{\sqrt{T}}\right)$ last-iterate convergence rate for arbitrary convex feasible sets, which matches the lower bound by Golowich et al. [2020a, b]. Our rate is measured in terms of the standard gap function. At the core of our results lies a new performance measure -- the tangent residual, which can be viewed as an adaptation of the norm of the operator that takes the local constraints into account. We use the tangent residual (or a slight variation of the tangent residual) as the performance measure in our analysis of the extragradient algorithm (or the optimistic gradient descent-ascent algorithm). To establish the monotonicity of these performance measures, we develop a new approach that combines the power of the sum-of-squares programming with the low dimensionality of the update rule of the extragradient or the optimistic gradient descent-ascent algorithm. We believe our approach has many additional applications in the analysis of iterative methods.
    Labeled sample compression schemes for complexes of oriented matroids. (arXiv:2110.15168v2 [math.CO] UPDATED)
    We show that the topes of a complex of oriented matroids (abbreviated COM) of VC-dimension $d$ admit a proper labeled sample compression scheme of size $d$. This considerably extends results of Moran and Warmuth on ample classes, of Ben-David and Litman on affine arrangements of hyperplanes, and of the authors on complexes of uniform oriented matroids, and is a step towards the sample compression conjecture -- one of the oldest open problems in computational learning theory. On the one hand, our approach exploits the rich combinatorial cell structure of COMs via oriented matroid theory. On the other hand, viewing tope graphs of COMs as partial cubes creates a fruitful link to metric graph theory.
    Memory-Efficient Convex Optimization for Self-Dictionary Separable Nonnegative Matrix Factorization: A Frank-Wolfe Approach. (arXiv:2109.11135v2 [eess.SP] UPDATED)
    Nonnegative matrix factorization (NMF) often relies on the separability condition for tractable algorithm design. Separability-based NMF is mainly handled by two types of approaches, namely, greedy pursuit and convex programming. A notable convex NMF formulation is the so-called self-dictionary multiple measurement vectors (SD-MMV), which can work without knowing the matrix rank a priori, and is arguably more resilient to error propagation relative to greedy pursuit. However, convex SD-MMV renders a large memory cost that scales quadratically with the problem size. This memory challenge has been around for a decade, and a major obstacle for applying convex SD-MMV to big data analytics. This work proposes a memory-efficient algorithm for convex SD-MMV. Our algorithm capitalizes on the special update rules of a classic algorithm from the 1950s, namely, the Frank-Wolfe (FW) algorithm. It is shown that, under reasonable conditions, the FW algorithm solves the noisy SD-MMV problem with a memory cost that grows linearly with the amount of data. To handle noisier scenarios, a smoothed group sparsity regularizer is proposed to improve robustness while maintaining the low memory footprint with guarantees. The proposed approach presents the first linear memory complexity algorithmic framework for convex SD-MMV based NMF. The method is tested over a couple of unsupervised learning tasks, i.e., text mining and community detection, to showcase its effectiveness and memory efficiency.
    Text Simplification by Tagging. (arXiv:2103.05070v1 [cs.CL] CROSS LISTED)
    Edit-based approaches have recently shown promising results on multiple monolingual sequence transduction tasks. In contrast to conventional sequence-to-sequence (Seq2Seq) models, which learn to generate text from scratch as they are trained on parallel corpora, these methods have proven to be much more effective since they are able to learn to make fast and accurate transformations while leveraging powerful pre-trained language models. Inspired by these ideas, we present TST, a simple and efficient Text Simplification system based on sequence Tagging, leveraging pre-trained Transformer-based encoders. Our system makes simplistic data augmentations and tweaks in training and inference on a pre-existing system, which makes it less reliant on large amounts of parallel training data, provides more control over the outputs and enables faster inference speeds. Our best model achieves near state-of-the-art performance on benchmark test datasets for the task. Since it is fully non-autoregressive, it achieves faster inference speeds by over 11 times than the current state-of-the-art text simplification system.
    VOS: Learning What You Don't Know by Virtual Outlier Synthesis. (arXiv:2202.01197v4 [cs.LG] UPDATED)
    Out-of-distribution (OOD) detection has received much attention lately due to its importance in the safe deployment of neural networks. One of the key challenges is that models lack supervision signals from unknown data, and as a result, can produce overconfident predictions on OOD data. Previous approaches rely on real outlier datasets for model regularization, which can be costly and sometimes infeasible to obtain in practice. In this paper, we present VOS, a novel framework for OOD detection by adaptively synthesizing virtual outliers that can meaningfully regularize the model's decision boundary during training. Specifically, VOS samples virtual outliers from the low-likelihood region of the class-conditional distribution estimated in the feature space. Alongside, we introduce a novel unknown-aware training objective, which contrastively shapes the uncertainty space between the ID data and synthesized outlier data. VOS achieves competitive performance on both object detection and image classification models, reducing the FPR95 by up to 9.36% compared to the previous best method on object detectors. Code is available at https://github.com/deeplearning-wisc/vos.
    Augmenting Physical Models with Deep Networks for Complex Dynamics Forecasting. (arXiv:2010.04456v6 [stat.ML] UPDATED)
    Forecasting complex dynamical phenomena in settings where only partial knowledge of their dynamics is available is a prevalent problem across various scientific fields. While purely data-driven approaches are arguably insufficient in this context, standard physical modeling based approaches tend to be over-simplistic, inducing non-negligible errors. In this work, we introduce the APHYNITY framework, a principled approach for augmenting incomplete physical dynamics described by differential equations with deep data-driven models. It consists in decomposing the dynamics into two components: a physical component accounting for the dynamics for which we have some prior knowledge, and a data-driven component accounting for errors of the physical model. The learning problem is carefully formulated such that the physical model explains as much of the data as possible, while the data-driven component only describes information that cannot be captured by the physical model, no more, no less. This not only provides the existence and uniqueness for this decomposition, but also ensures interpretability and benefits generalization. Experiments made on three important use cases, each representative of a different family of phenomena, i.e. reaction-diffusion equations, wave equations and the non-linear damped pendulum, show that APHYNITY can efficiently leverage approximate physical models to accurately forecast the evolution of the system and correctly identify relevant physical parameters. Code is available at https://github.com/yuan-yin/APHYNITY .
    Human-robot collaboration and machine learning: a systematic review of recent research. (arXiv:2110.07448v3 [cs.RO] UPDATED)
    Technological progress increasingly envisions the use of robots interacting with people in everyday life. Human-robot collaboration (HRC) is the approach that explores the interaction between a human and a robot, during the completion of a common objective, at the cognitive and physical level. In HRC works, a cognitive model is typically built, which collects inputs from the environment and from the user, elaborates and translates these into information that can be used by the robot itself. Machine learning is a recent approach to build the cognitive model and behavioural block, with high potential in HRC. Consequently, this paper proposes a thorough literature review of the use of machine learning techniques in the context of human-robot collaboration. 45 key papers were selected and analysed, and a clustering of works based on the type of collaborative tasks, evaluation metrics and cognitive variables modelled is proposed. Then, a deep analysis on different families of machine learning algorithms and their properties, along with the sensing modalities used, is carried out. Among the observations, it is outlined the importance of the machine learning algorithms to incorporate time dependencies. The salient features of these works are then cross-analysed to show trends in HRC and give guidelines for future works, comparing them with other aspects of HRC not appeared in the review.
    Semi-Targeted Model Poisoning Attack on Federated Learning via Backward Error Analysis. (arXiv:2203.11633v2 [cs.LG] UPDATED)
    Model poisoning attacks on federated learning (FL) intrude in the entire system via compromising an edge model, resulting in malfunctioning of machine learning models. Such compromised models are tampered with to perform adversary-desired behaviors. In particular, we considered a semi-targeted situation where the source class is predetermined however the target class is not. The goal is to cause the global classifier to misclassify data of the source class. Though approaches such as label flipping have been adopted to inject poisoned parameters into FL, it has been shown that their performances are usually class-sensitive varying with different target classes applied. Typically, an attack can become less effective when shifting to a different target class. To overcome this challenge, we propose the Attacking Distance-aware Attack (ADA) to enhance a poisoning attack by finding the optimized target class in the feature space. Moreover, we studied a more challenging situation where an adversary had limited prior knowledge about a client's data. To tackle this problem, ADA deduces pair-wise distances between different classes in the latent feature space from shared model parameters based on the backward error analysis. We performed extensive empirical evaluations on ADA by varying the factor of attacking frequency in three different image classification tasks. As a result, ADA succeeded in increasing the attack performance by 1.8 times in the most challenging case with an attacking frequency of 0.01.
    Why Exposure Bias Matters: An Imitation Learning Perspective of Error Accumulation in Language Generation. (arXiv:2204.01171v2 [cs.CL] UPDATED)
    Current language generation models suffer from issues such as repetition, incoherence, and hallucinations. An often-repeated hypothesis is that this brittleness of generation models is caused by the training and the generation procedure mismatch, also referred to as exposure bias. In this paper, we verify this hypothesis by analyzing exposure bias from an imitation learning perspective. We show that exposure bias leads to an accumulation of errors, analyze why perplexity fails to capture this accumulation, and empirically show that this accumulation results in poor generation quality. Source code to reproduce these experiments is available at https://github.com/kushalarora/quantifying_exposure_bias
    Verifying Integrity of Deep Ensemble Models by Lossless Black-box Watermarking with Sensitive Samples. (arXiv:2205.04145v2 [cs.CR] UPDATED)
    With the widespread use of deep neural networks (DNNs) in many areas, more and more studies focus on protecting DNN models from intellectual property (IP) infringement. Many existing methods apply digital watermarking to protect the DNN models. The majority of them either embed a watermark directly into the internal network structure/parameters or insert a zero-bit watermark by fine-tuning a model to be protected with a set of so-called trigger samples. Though these methods work very well, they were designed for individual DNN models, which cannot be directly applied to deep ensemble models (DEMs) that combine multiple DNN models to make the final decision. It motivates us to propose a novel black-box watermarking method in this paper for DEMs, which can be used for verifying the integrity of DEMs. In the proposed method, a certain number of sensitive samples are carefully selected through mimicking real-world DEM attacks and analyzing the prediction results of the sub-models of the non-attacked DEM and the attacked DEM on the carefully crafted dataset. By analyzing the prediction results of the target DEM on these carefully crafted sensitive samples, we are able to verify the integrity of the target DEM. Different from many previous methods, the proposed method does not modify the original DEM to be protected, which indicates that the proposed method is lossless. Experimental results have shown that the DEM integrity can be reliably verified even if only one sub-model was attacked, which has good potential in practice.
    Exploring Viable Algorithmic Options for Learning from Demonstration (LfD): A Parameterized Complexity Approach. (arXiv:2205.04989v1 [cs.LG])
    The key to reconciling the polynomial-time intractability of many machine learning tasks in the worst case with the surprising solvability of these tasks by heuristic algorithms in practice seems to be exploiting restrictions on real-world data sets. One approach to investigating such restrictions is to analyze why heuristics perform well under restrictions. A complementary approach would be to systematically determine under which sets of restrictions efficient and reliable machine learning algorithms do and do not exist. In this paper, we show how such a systematic exploration of algorithmic options can be done using parameterized complexity analysis, As an illustrative example, we give the first parameterized complexity analysis of batch and incremental policy inference under Learning from Demonstration (LfD). Relative to a basic model of LfD, we show that none of our problems can be solved efficiently either in general or relative to a number of (often simultaneous) restrictions on environments, demonstrations, and policies. We also give the first known restrictions under which efficient solvability is possible and discuss the implications of our solvability and unsolvability results for both our basic model of LfD and more complex models of LfD used in practice.
    Representing Hierarchical Structure by Using Cone Embedding. (arXiv:2102.08014v2 [cs.AI] UPDATED)
    Graph embedding is becoming an important method with applications in various areas, including social networks and knowledge graph completion. In particular, Poincar\'e embedding has been proposed to capture the hierarchical structure of graphs, and its effectiveness has been reported. However, most of the existing methods have isometric mappings in the embedding space, and the choice of the origin point can be arbitrary. This fact is not desirable when the distance from the origin is used as an indicator of hierarchy, as in the case of Poincar\'e embedding. In this paper, we propose cone embedding, embedding method in a metric cone, which solve these problems, and we gain further benefits: 1) we provide an indicator of hierarchical information that is both geometrically and intuitively natural to interpret, and 2) we can extract the hierarchical structure from a graph embedding output of other methods by learning additional one-dimensional parameters.
    Entity Linking and Discovery via Arborescence-based Supervised Clustering. (arXiv:2109.01242v2 [cs.CL] UPDATED)
    Previous work has shown promising results in performing entity linking by measuring not only the affinities between mentions and entities but also those amongst mentions. In this paper, we present novel training and inference procedures that fully utilize mention-to-mention affinities by building minimum arborescences (i.e., directed spanning trees) over mentions and entities across documents in order to make linking decisions. We also show that this method gracefully extends to entity discovery, enabling the clustering of mentions that do not have an associated entity in the knowledge base. We evaluate our approach on the Zero-Shot Entity Linking dataset and MedMentions, the largest publicly available biomedical dataset, and show significant improvements in performance for both entity linking and discovery compared to identically parameterized models. We further show significant efficiency improvements with only a small loss in accuracy over previous work, which use more computationally expensive models.
    Learning to Answer Visual Questions from Web Videos. (arXiv:2205.05019v1 [cs.CV])
    Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. Given narrated videos, we then automatically generate the HowToVQA69M dataset with 69M video-question-answer triplets. To handle the open vocabulary of diverse answers in this dataset, we propose a training procedure based on a contrastive loss between a video-question multi-modal transformer and an answer transformer. We introduce the zero-shot VideoQA task and the VideoQA feature probe evaluation setting and show excellent results, in particular for rare answers. Furthermore, our method achieves competitive results on MSRVTT-QA, ActivityNet-QA, MSVD-QA and How2QA datasets. We also show that our VideoQA dataset generation approach generalizes to another source of web video and text data. We use our method to generate the \webdataname{} dataset from the WebVid dataset, i.e., videos with alt-text annotations, and show its benefits for training VideoQA models. Finally, for a detailed evaluation we introduce \smalldatasetname{}, a new VideoQA dataset with reduced language bias and high-quality manual annotations. Code, datasets and trained models are available at https://antoyang.github.io/just-ask.html
    Fundamental limitations on optimization in variational quantum algorithms. (arXiv:2205.05056v1 [quant-ph])
    Exploring quantum applications of near-term quantum devices is a rapidly growing field of quantum information science with both theoretical and practical interests. A leading paradigm to establish such near-term quantum applications is variational quantum algorithms (VQAs). These algorithms use a classical optimizer to train a parameterized quantum circuit to accomplish certain tasks, where the circuits are usually randomly initialized. In this work, we prove that for a broad class of such random circuits, the variation range of the cost function via adjusting any local quantum gate within the circuit vanishes exponentially in the number of qubits with a high probability. This result can unify the restrictions on gradient-based and gradient-free optimizations in a natural manner and reveal extra harsh constraints on the training landscapes of VQAs. Hence a fundamental limitation on the trainability of VQAs is unraveled, indicating the essence of the optimization hardness in the Hilbert space with exponential dimension. We further showcase the validity of our results with numerical simulations of representative VQAs. We believe that these results would deepen our understanding of the scalability of VQAs and shed light on the search for near-term quantum applications with advantages.
    Adaptive Ranking Based Constraint Handling for Explicitly Constrained Black-Box Optimization. (arXiv:1811.00764v3 [cs.NE] UPDATED)
    We propose a novel constraint-handling technique for the covariance matrix adaptation evolution strategy (CMA-ES). The proposed technique is aimed at solving explicitly constrained black-box continuous optimization problems, in which the explicit constraint is a constraint whereby the computational time for the constraint violation and its (numerical) gradient are negligible compared to that for the objective function. This method is designed to realize two invariance properties: invariance to the affine transformation of the search space, and invariance to the increasing transformation of the objective and constraint functions. The CMA-ES is designed to possess these properties for handling difficulties that appear in black-box optimization problems, such as non-separability, ill-conditioning, ruggedness, and the different orders of magnitude in the objective. The proposed constraint-handling technique (CHT), known as ARCH, modifies the underlying CMA-ES only in terms of the ranking of the candidate solutions. It employs a repair operator and an adaptive ranking aggregation strategy to compute the ranking. We developed test problems to evaluate the effects of the invariance properties, and performed experiments to empirically verify the invariance of the algorithm. We compared the proposed method with other CHTs on the CEC 2006 constrained optimization benchmark suite to demonstrate its efficacy. Empirical studies reveal that ARCH is able to exploit the explicitness of the constraint functions effectively, sometimes even more efficiently than an existing box-constraint handling technique on box-constrained problems, while exhibiting the invariance properties. Moreover, ARCH overwhelmingly outperforms CHTs by not exploiting the explicit constraints in terms of the number of objective function calls.
    Hybrid Far- and Near-Field Channel Estimation for THz Ultra-Massive MIMO via Fixed Point Networks. (arXiv:2205.04944v1 [eess.SP])
    Terahertz ultra-massive multiple-input multiple-output (THz UM-MIMO) is envisioned as one of the key enablers of 6G wireless systems. Due to the joint effect of its large array aperture and small wavelength, the near-field region of THz UM-MIMO systems is greatly enlarged. The high-dimensional channel of such systems thus consists of a stochastic mixture of far and near fields, which renders channel estimation extremely challenging. Previous works based on uni-field assumptions cannot capture the hybrid far- and near-field features, and will suffer significant performance loss. This motivates us to consider hybrid-field channel estimation. We draw inspirations from fixed point theory to develop an efficient deep learning based channel estimator with adaptive complexity and linear convergence guarantee. Built upon classic orthogonal approximate message passing, we transform each iteration into a contractive mapping, comprising a closed-form linear estimator and a neural network based non-linear estimator. A major algorithmic innovation involves applying fixed point iteration to compute the channel estimate while modeling neural networks with arbitrary depth and adapting to the hybrid-field channel conditions. Simulation results will verify our theoretical analysis and show significant performance gains over state-of-the-art approaches in the estimation accuracy and convergence rate.
    Impact of L1 Batch Normalization on Analog Noise Resistant Property of Deep Learning Models. (arXiv:2205.04886v1 [cs.LG])
    Analog hardware has become a popular choice for machine learning on resource-constrained devices recently due to its fast execution and energy efficiency. However, the inherent presence of noise in analog hardware and the negative impact of the noise on deployed deep neural network (DNN) models limit their usage. The degradation in performance due to the noise calls for the novel design of DNN models that have excellent noiseresistant property, leveraging the properties of the fundamental building block of DNN models. In this work, the use of L1 or TopK BatchNorm type, a fundamental DNN model building block, in designing DNN models with excellent noise-resistant property is proposed. Specifically, a systematic study has been carried out by training DNN models with L1/TopK BatchNorm type, and the performance is compared with DNN models with L2 BatchNorm types. The resulting model noise-resistant property is tested by injecting additive noise to the model weights and evaluating the new model inference accuracy due to the noise. The results show that L1 and TopK BatchNorm type has excellent noise-resistant property, and there is no sacrifice in performance due to the change in the BatchNorm type from L2 to L1/TopK BatchNorm type.
    On Lottery Tickets and Minimal Task Representations in Deep Reinforcement Learning. (arXiv:2105.01648v4 [cs.LG] UPDATED)
    The lottery ticket hypothesis questions the role of overparameterization in supervised deep learning. But how is the performance of winning lottery tickets affected by the distributional shift inherent to reinforcement learning problems? In this work, we address this question by comparing sparse agents who have to address the non-stationarity of the exploration-exploitation problem with supervised agents trained to imitate an expert. We show that feed-forward networks trained with behavioural cloning compared to reinforcement learning can be pruned to higher levels of sparsity without performance degradation. This suggests that in order to solve the RL-specific distributional shift agents require more degrees of freedom. Using a set of carefully designed baseline conditions, we find that the majority of the lottery ticket effect in both learning paradigms can be attributed to the identified mask rather than the weight initialization. The input layer mask selectively prunes entire input dimensions that turn out to be irrelevant for the task at hand. At a moderate level of sparsity the mask identified by iterative magnitude pruning yields minimal task-relevant representations, i.e., an interpretable inductive bias. Finally, we propose a simple initialization rescaling which promotes the robust identification of sparse task representations in low-dimensional control tasks.
    Text-Free Prosody-Aware Generative Spoken Language Modeling. (arXiv:2109.03264v2 [cs.CL] UPDATED)
    Speech pre-training has primarily demonstrated efficacy on classification tasks, while its capability of generating novel speech, similar to how GPT-2 can generate coherent paragraphs, has barely been explored. Generative Spoken Language Modeling (GSLM) \cite{Lakhotia2021} is the only prior work addressing the generative aspects of speech pre-training, which replaces text with discovered phone-like units for language modeling and shows the ability to generate meaningful novel sentences. Unfortunately, despite eliminating the need of text, the units used in GSLM discard most of the prosodic information. Hence, GSLM fails to leverage prosody for better comprehension, and does not generate expressive speech. In this work, we present a prosody-aware generative spoken language model (pGSLM). It is composed of a multi-stream transformer language model (MS-TLM) of speech, represented as discovered unit and prosodic feature streams, and an adapted HiFi-GAN model converting MS-TLM outputs to waveforms. We devise a series of metrics for prosody modeling and generation, and re-use metrics from GSLM for content modeling. Experimental results show that the pGSLM can utilize prosody to improve both prosody and content modeling, and also generate natural, meaningful, and coherent speech given a spoken prompt. Audio samples can be found at https://speechbot.github.io/pgslm. Codes and models are available at https://github.com/pytorch/fairseq/tree/main/examples/textless_nlp/pgslm.
    Adaptive Graph Convolutional Network Framework for Multidimensional Time Series Prediction. (arXiv:2205.04885v1 [cs.LG])
    In the real world, long sequence time-series forecasting (LSTF) is needed in many cases, such as power consumption prediction and air quality prediction.Multi-dimensional long time series model has more strict requirements on the model, which not only needs to effectively capture the accurate long-term dependence between input and output, but also needs to capture the relationship between data of different dimensions.Recent research shows that the Informer model based on Transformer has achieved excellent performance in long time series prediction.However, this model still has some deficiencies in multidimensional prediction,it cannot capture the relationship between different dimensions well. We improved Informer to address its shortcomings in multidimensional forecasting. First,we introduce an adaptive graph neural network to capture hidden dimension dependencies in mostly time series prediction. Secondly,we integrate adaptive graph convolutional networks into various spatio-temporal series prediction models to solve the defect that they cannot capture the relationship between different dimensions. Thirdly,After experimental testing with multiple data sets, the accuracy of our framework improved by about 10\% after being introduced into the model.
    An overview of artificial intelligence techniques for diagnosis of Schizophrenia based on magnetic resonance imaging modalities: Methods, challenges, and future works. (arXiv:2103.03081v3 [cs.LG] UPDATED)
    Schizophrenia (SZ) is a mental disorder that typically emerges in late adolescence or early adulthood. It reduces the life expectancy of patients by 15 years. Abnormal behavior, perception of emotions, social relationships, and reality perception are among its most significant symptoms. Past studies have revealed that SZ affects the temporal and anterior lobes of hippocampus regions of the brain. Also, increased volume of cerebrospinal fluid (CSF) and decreased volume of white and gray matter can be observed due to this disease. Magnetic resonance imaging (MRI) is the popular neuroimaging technique used to explore structural/functional brain abnormalities in SZ disorder, owing to its high spatial resolution. Various artificial intelligence (AI) techniques have been employed with advanced image/signal processing methods to accurately diagnose SZ. This paper presents a comprehensive overview of studies conducted on the automated diagnosis of SZ using MRI modalities. First, an AI-based computer aided-diagnosis system (CADS) for SZ diagnosis and its relevant sections are presented. Then, this section introduces the most important conventional machine learning (ML) and deep learning (DL) techniques in the diagnosis of diagnosing SZ. A comprehensive comparison is also made between ML and DL studies in the discussion section. In the following, the most important challenges in diagnosing SZ are addressed. Future works in diagnosing SZ using AI techniques and MRI modalities are recommended in another section. Results, conclusion, and research findings are also presented at the end.
    On learning agent-based models from data. (arXiv:2205.05052v1 [physics.soc-ph])
    Agent-Based Models (ABMs) are used in several fields to study the evolution of complex systems from micro-level assumptions. However, ABMs typically can not estimate agent-specific (or "micro") variables: this is a major limitation which prevents ABMs from harnessing micro-level data availability and which greatly limits their predictive power. In this paper, we propose a protocol to learn the latent micro-variables of an ABM from data. The first step of our protocol is to reduce an ABM to a probabilistic model, characterized by a computationally tractable likelihood. This reduction follows two general design principles: balance of stochasticity and data availability, and replacement of unobservable discrete choices with differentiable approximations. Then, our protocol proceeds by maximizing the likelihood of the latent variables via a gradient-based expectation maximization algorithm. We demonstrate our protocol by applying it to an ABM of the housing market, in which agents with different incomes bid higher prices to live in high-income neighborhoods. We demonstrate that the obtained model allows accurate estimates of the latent variables, while preserving the general behavior of the ABM. We also show that our estimates can be used for out-of-sample forecasting. Our protocol can be seen as an alternative to black-box data assimilation methods, that forces the modeler to lay bare the assumptions of the model, to think about the inferential process, and to spot potential identification problems.
    Secure Distributed/Federated Learning: Prediction-Privacy Trade-Off for Multi-Agent System. (arXiv:2205.04855v1 [cs.MA])
    Decentralized learning is an efficient emerging paradigm for boosting the computing capability of multiple bounded computing agents. In the big data era, performing inference within the distributed and federated learning (DL and FL) frameworks, the central server needs to process a large amount of data while relying on various agents to perform multiple distributed training tasks. Considering the decentralized computing topology, privacy has become a first-class concern. Moreover, assuming limited information processing capability for the agents calls for a sophisticated \textit{privacy-preserving decentralization} that ensures efficient computation. Towards this end, we study the \textit{privacy-aware server to multi-agent assignment} problem subject to information processing constraints associated with each agent, while maintaining the privacy and assuring learning informative messages received by agents about a global terminal through the distributed private federated learning (DPFL) approach. To find a decentralized scheme for a two-agent system, we formulate an optimization problem that balances privacy and accuracy, taking into account the quality of compression constraints associated with each agent. We propose an iterative converging algorithm by alternating over self-consistent equations. We also numerically evaluate the proposed solution to show the privacy-prediction trade-off and demonstrate the efficacy of the novel approach in ensuring privacy in DL and FL.
    On the Verge of Solving Rocket League using Deep Reinforcement Learning and Sim-to-sim Transfer. (arXiv:2205.05061v1 [cs.LG])
    Autonomously trained agents that are supposed to play video games reasonably well rely either on fast simulation speeds or heavy parallelization across thousands of machines running concurrently. This work explores a third way that is established in robotics, namely sim-to-real transfer, or if the game is considered a simulation itself, sim-to-sim transfer. In the case of Rocket League, we demonstrate that single behaviors of goalies and strikers can be successfully learned using Deep Reinforcement Learning in the simulation environment and transferred back to the original game. Although the implemented training simulation is to some extent inaccurate, the goalkeeping agent saves nearly 100% of its faced shots once transferred, while the striking agent scores in about 75% of cases. Therefore, the trained agent is robust enough and able to generalize to the target domain of Rocket League.
    Neural Collapse Under MSE Loss: Proximity to and Dynamics on the Central Path. (arXiv:2106.02073v4 [cs.LG] UPDATED)
    The recently discovered Neural Collapse (NC) phenomenon occurs pervasively in today's deep net training paradigm of driving cross-entropy (CE) loss towards zero. During NC, last-layer features collapse to their class-means, both classifiers and class-means collapse to the same Simplex Equiangular Tight Frame, and classifier behavior collapses to the nearest-class-mean decision rule. Recent works demonstrated that deep nets trained with mean squared error (MSE) loss perform comparably to those trained with CE. As a preliminary, we empirically establish that NC emerges in such MSE-trained deep nets as well through experiments on three canonical networks and five benchmark datasets. We provide, in a Google Colab notebook, PyTorch code for reproducing MSE-NC and CE-NC: at https://colab.research.google.com/github/neuralcollapse/neuralcollapse/blob/main/neuralcollapse.ipynb. The analytically-tractable MSE loss offers more mathematical opportunities than the hard-to-analyze CE loss, inspiring us to leverage MSE loss towards the theoretical investigation of NC. We develop three main contributions: (I) We show a new decomposition of the MSE loss into (A) terms directly interpretable through the lens of NC and which assume the last-layer classifier is exactly the least-squares classifier; and (B) a term capturing the deviation from this least-squares classifier. (II) We exhibit experiments on canonical datasets and networks demonstrating that term-(B) is negligible during training. This motivates us to introduce a new theoretical construct: the central path, where the linear classifier stays MSE-optimal for feature activations throughout the dynamics. (III) By studying renormalized gradient flow along the central path, we derive exact dynamics that predict NC.
    Concave Utility Reinforcement Learning with Zero-Constraint Violations. (arXiv:2109.05439v2 [cs.LG] UPDATED)
    We consider the problem of tabular infinite horizon concave utility reinforcement learning (CURL) with convex constraints. Various learning applications with constraints, such as robotics, do not allow for policies that can violate constraints. To this end, we propose a model-based learning algorithm that achieves zero constraint violations. To obtain this result, we assume that the concave objective and the convex constraints have a solution interior to the set of feasible occupation measures. We then solve a tighter optimization problem to ensure that the constraints are never violated despite the imprecise model knowledge and model stochasticity. We also propose a novel Bellman error based analysis for tabular infinite-horizon setups which allows to analyse stochastic policies. Combining the Bellman error based analysis and tighter optimization equation, for $T$ interactions with the environment, we obtain a regret guarantee for objective which grows as $\Tilde{O}(1/\sqrt{T})$, excluding other factors.
    Search-Based Testing of Reinforcement Learning. (arXiv:2205.04887v1 [cs.LG])
    Evaluation of deep reinforcement learning (RL) is inherently challenging. Especially the opaqueness of learned policies and the stochastic nature of both agents and environments make testing the behavior of deep RL agents difficult. We present a search-based testing framework that enables a wide range of novel analysis capabilities for evaluating the safety and performance of deep RL agents. For safety testing, our framework utilizes a search algorithm that searches for a reference trace that solves the RL task. The backtracking states of the search, called boundary states, pose safety-critical situations. We create safety test-suites that evaluate how well the RL agent escapes safety-critical situations near these boundary states. For robust performance testing, we create a diverse set of traces via fuzz testing. These fuzz traces are used to bring the agent into a wide variety of potentially unknown states from which the average performance of the agent is compared to the average performance of the fuzz traces. We apply our search-based testing approach on RL for Nintendo's Super Mario Bros.
    Domain Generalization: A Survey. (arXiv:2103.02503v5 [cs.LG] UPDATED)
    Generalization to out-of-distribution (OOD) data is a capability natural to humans yet challenging for machines to reproduce. This is because most learning algorithms strongly rely on the i.i.d.~assumption on source/target data, which is often violated in practice due to domain shift. Domain generalization (DG) aims to achieve OOD generalization by using only source data for model learning. Over the last ten years, research in DG has made great progress, leading to a broad spectrum of methodologies, e.g., those based on domain alignment, meta-learning, data augmentation, or ensemble learning, to name a few; DG has also been studied in various application areas including computer vision, speech recognition, natural language processing, medical imaging, and reinforcement learning. In this paper, for the first time a comprehensive literature review in DG is provided to summarize the developments over the past decade. Specifically, we first cover the background by formally defining DG and relating it to other relevant fields like domain adaptation and transfer learning. Then, we conduct a thorough review into existing methods and theories. Finally, we conclude this survey with insights and discussions on future research directions.
    Online AutoML: An adaptive AutoML framework for online learning. (arXiv:2201.09750v2 [cs.LG] UPDATED)
    Automated Machine Learning (AutoML) has been used successfully in settings where the learning task is assumed to be static. In many real-world scenarios, however, the data distribution will evolve over time, and it is yet to be shown whether AutoML techniques can effectively design online pipelines in dynamic environments. This study aims to automate pipeline design for online learning while continuously adapting to data drift. For this purpose, we design an adaptive Online Automated Machine Learning (OAML) system, searching the complete pipeline configuration space of online learners, including preprocessing algorithms and ensembling techniques. This system combines the inherent adaptation capabilities of online learners with the fast automated pipeline (re)optimization capabilities of AutoML. Focusing on optimization techniques that can adapt to evolving objectives, we evaluate asynchronous genetic programming and asynchronous successive halving to optimize these pipelines continually. We experiment on real and artificial data streams with varying types of concept drift to test the performance and adaptation capabilities of the proposed system. The results confirm the utility of OAML over popular online learning algorithms and underscore the benefits of continuous pipeline redesign in the presence of data drift.
    Cluster-based Input Weight Initialization for Echo State Networks. (arXiv:2103.04710v3 [cs.LG] CROSS LISTED)
    Echo State Networks (ESNs) are a special type of recurrent neural networks (RNNs), in which the input and recurrent connections are traditionally generated randomly, and only the output weights are trained. Despite the recent success of ESNs in various tasks of audio, image and radar recognition, we postulate that a purely random initialization is not the ideal way of initializing ESNs. The aim of this work is to propose an unsupervised initialization of the input connections using the $K$-Means algorithm on the training data. We show that for a large variety of datasets this initialization performs equivalently or superior than a randomly initialized ESN whilst needing significantly less reservoir neurons. Furthermore, we discuss that this approach provides the opportunity to estimate a suitable size of the reservoir based on prior knowledge about the data.
    Turtle Score -- Similarity Based Developer Analyzer. (arXiv:2205.04876v1 [stat.ML])
    In day-to-day life, a highly demanding task for IT companies is to find the right candidates who fit the companies' culture. This research aims to comprehend, analyze and automatically produce convincing outcomes to find a candidate who perfectly fits right in the company. Data is examined and collected for each employee who works in the IT domain focusing on their performance measure. This is done based on various different categories which bring versatility and a wide view of focus. To this data, learner analysis is done using machine learning algorithms to obtain learner similarity and developer similarity in order to recruit people with identical working patterns. It's been proven that the efficiency and capability of a particular worker go higher when working with a person of a similar personality. Therefore this will serve as a useful tool for recruiters who aim to recruit people with high productivity. This is to say that the model designed will render the best outcome possible with high accuracy and an immaculate recommendation score.
    A Wasserstein distance approach for concentration of empirical risk estimates. (arXiv:1902.10709v4 [math.ST] UPDATED)
    This paper presents a unified approach based on Wasserstein distance to derive concentration bounds for empirical estimates for two broad classes of risk measures defined in the paper. The classes of risk measures introduced include as special cases well known risk measures from the finance literature such as conditional value at risk (CVaR), optimized certainty equivalent risk, spectral risk measures, utility-based shortfall risk, cumulative prospect theory (CPT) value, rank dependent expected utility and distorted risk measures. Two estimation schemes are considered, one for each class of risk measures. One estimation scheme involves applying the risk measure to the empirical distribution function formed from a collection of i.i.d. samples of the random variable (r.v.), while the second scheme involves applying the same procedure to a truncated sample. The bounds provided apply to three popular classes of distributions, namely sub-Gaussian, sub-exponential and heavy-tailed distributions. The bounds are derived by first relating the estimation error to the Wasserstein distance between the true and empirical distributions, and then using recent concentration bounds for the latter. Previous concentration bounds are available only for specific risk measures such as CVaR and CPT-value. The bounds derived in this paper are shown to either match or improve upon previous bounds in cases where they are available. The usefulness of the bounds is illustrated through an algorithm and the corresponding regret bound for a stochastic bandit problem involving a general risk measure from each of the two classes introduced in the paper.
    Don't Throw it Away! The Utility of Unlabeled Data in Fair Decision Making. (arXiv:2205.04790v1 [stat.ML])
    Decision making algorithms, in practice, are often trained on data that exhibits a variety of biases. Decision-makers often aim to take decisions based on some ground-truth target that is assumed or expected to be unbiased, i.e., equally distributed across socially salient groups. In many practical settings, the ground-truth cannot be directly observed, and instead, we have to rely on a biased proxy measure of the ground-truth, i.e., biased labels, in the data. In addition, data is often selectively labeled, i.e., even the biased labels are only observed for a small fraction of the data that received a positive decision. To overcome label and selection biases, recent work proposes to learn stochastic, exploring decision policies via i) online training of new policies at each time-step and ii) enforcing fairness as a constraint on performance. However, the existing approach uses only labeled data, disregarding a large amount of unlabeled data, and thereby suffers from high instability and variance in the learned decision policies at different times. In this paper, we propose a novel method based on a variational autoencoder for practical fair decision-making. Our method learns an unbiased data representation leveraging both labeled and unlabeled data and uses the representations to learn a policy in an online process. Using synthetic data, we empirically validate that our method converges to the optimal (fair) policy according to the ground-truth with low variance. In real-world experiments, we further show that our training approach not only offers a more stable learning process but also yields policies with higher fairness as well as utility than previous approaches.
    Disentangling A Single MR Modality. (arXiv:2205.04982v1 [eess.IV])
    Disentangling anatomical and contrast information from medical images has gained attention recently, demonstrating benefits for various image analysis tasks. Current methods learn disentangled representations using either paired multi-modal images with the same underlying anatomy or auxiliary labels (e.g., manual delineations) to provide inductive bias for disentanglement. However, these requirements could significantly increase the time and cost in data collection and limit the applicability of these methods when such data are not available. Moreover, these methods generally do not guarantee disentanglement. In this paper, we present a novel framework that learns theoretically and practically superior disentanglement from single modality magnetic resonance images. Moreover, we propose a new information-based metric to quantitatively evaluate disentanglement. Comparisons over existing disentangling methods demonstrate that the proposed method achieves superior performance in both disentanglement and cross-domain image-to-image translation tasks.
    On Recurrent Neural Networks for learning-based control: recent results and ideas for future developments. (arXiv:2111.13557v2 [eess.SY] UPDATED)
    This paper aims to discuss and analyze the potentialities of Recurrent Neural Networks (RNN) in control design applications. The main families of RNN are considered, namely Neural Nonlinear AutoRegressive eXogenous, (NNARX), Echo State Networks (ESN), Long Short Term Memory (LSTM), and Gated Recurrent Units (GRU). The goal is twofold. Firstly, to survey recent results concerning the training of RNN that enjoy Input-to-State Stability (ISS) and Incremental Input-to-State Stability ($\delta$ISS) guarantees. Secondly, to discuss the issues that still hinder the widespread use of RNN for control, namely their robustness, verifiability, and interpretability. The former properties are related to the so-called generalization capabilities of the networks, i.e. their consistency with the underlying real plants, even in presence of unseen or perturbed input trajectories. The latter is instead related to the possibility of providing a clear formal connection between the RNN model and the plant. In this context, we illustrate how ISS and $\delta$ISS represent a significant step towards the robustness and verifiability of the RNN models, while the requirement of interpretability paves the way to the use of physics-based networks. The design of model predictive controllers with RNN as plant's model is also briefly discussed. Lastly, some of the main topics of the paper are illustrated on a simulated chemical system.
    Spike-based computational models of bio-inspired memories in the hippocampal CA3 region on SpiNNaker. (arXiv:2205.04782v1 [cs.NE])
    The human brain is the most powerful and efficient machine in existence today, surpassing in many ways the capabilities of modern computers. Currently, lines of research in neuromorphic engineering are trying to develop hardware that mimics the functioning of the brain to acquire these superior capabilities. One of the areas still under development is the design of bio-inspired memories, where the hippocampus plays an important role. This region of the brain acts as a short-term memory with the ability to store associations of information from different sensory streams in the brain and recall them later. This is possible thanks to the recurrent collateral network architecture that constitutes CA3, the main sub-region of the hippocampus. In this work, we developed two spike-based computational models of fully functional hippocampal bio-inspired memories for the storage and recall of complex patterns implemented with spiking neural networks on the SpiNNaker hardware platform. These models present different levels of biological abstraction, with the first model having a constant oscillatory activity closer to the biological model, and the second one having an energy-efficient regulated activity, which, although it is still bio-inspired, opts for a more functional approach. Different experiments were performed for each of the models, in order to test their learning/recalling capabilities. A comprehensive comparison between the functionality and the biological plausibility of the presented models was carried out, showing their strengths and weaknesses. The two models, which are publicly available for researchers, could pave the way for future spike-based implementations and applications.
    Gradient flows on graphons: existence, convergence, continuity equations. (arXiv:2111.09459v2 [math.PR] UPDATED)
    Wasserstein gradient flows on probability measures have found a host of applications in various optimization problems. They typically arise as the continuum limit of exchangeable particle systems evolving by some mean-field interaction involving a gradient-type potential. However, in many problems, such as in multi-layer neural networks, the so-called particles are edge weights on large graphs whose nodes are exchangeable. Such large graphs are known to converge to continuum limits called graphons as their size grow to infinity. We show that the Euclidean gradient flow of a suitable function of the edge-weights converges to a novel continuum limit given by a curve on the space of graphons that can be appropriately described as a gradient flow or, more technically, a curve of maximal slope. Several natural functions on graphons, such as homomorphism functions and the scalar entropy, are covered by our set-up, and the examples have been worked out in detail.
    Classification and mapping of low-statured 'shrubland' cover types in post-agricultural landscapes of the US Northeast. (arXiv:2205.05047v1 [cs.CV])
    Context: Novel plant communities reshape landscapes and pose challenges for land cover classification and mapping that can constrain research and stewardship efforts. In the US Northeast, emergence of low-statured woody vegetation, or 'shrublands', instead of secondary forests in post-agricultural landscapes is well-documented by field studies, but poorly understood from a landscape perspective, which limits the ability to systematically study and manage these lands. Objectives: To address gaps in classification/mapping of low-statured cover types where they have been historically rare, we developed models to predict 'shrubland' distributions at 30m resolution across New York State (NYS), using machine learning and model ensembling techniques to integrate remote sensing of structural (airborne LIDAR) and optical (satellite imagery) properties of vegetation cover. We first classified a 1m canopy height model (CHM), derived from a "patchwork" of available LIDAR coverages, to define shrubland presence/absence. Next, these non-contiguous maps were used to train a model ensemble based on temporally-segmented imagery to predict 'shrubland' probability for the entire study landscape (NYS). Results: Approximately 2.5% of the CHM coverage area was classified as shrubland. Models using Landsat predictors trained on the classified CHM were effective at identifying shrubland (test set AUC=0.893, real-world AUC=0.904), in discriminating between shrub/young forest and other cover classes, and produced qualitatively sensible maps, even when extending beyond the original training data. Conclusions: After ground-truthing, we expect these shrubland maps and models will have many research and stewardship applications including wildlife conservation, invasive species mitigation and natural climate solutions.
    Long-term stability and generalization of observationally-constrained stochastic data-driven models for geophysical turbulence. (arXiv:2205.04601v1 [cs.LG])
    Recent years have seen a surge in interest in building deep learning-based fully data-driven models for weather prediction. Such deep learning models if trained on observations can mitigate certain biases in current state-of-the-art weather models, some of which stem from inaccurate representation of subgrid-scale processes. However, these data-driven models, being over-parameterized, require a lot of training data which may not be available from reanalysis (observational data) products. Moreover, an accurate, noise-free, initial condition to start forecasting with a data-driven weather model is not available in realistic scenarios. Finally, deterministic data-driven forecasting models suffer from issues with long-term stability and unphysical climate drift, which makes these data-driven models unsuitable for computing climate statistics. Given these challenges, previous studies have tried to pre-train deep learning-based weather forecasting models on a large amount of imperfect long-term climate model simulations and then re-train them on available observational data. In this paper, we propose a convolutional variational autoencoder-based stochastic data-driven model that is pre-trained on an imperfect climate model simulation from a 2-layer quasi-geostrophic flow and re-trained, using transfer learning, on a small number of noisy observations from a perfect simulation. This re-trained model then performs stochastic forecasting with a noisy initial condition sampled from the perfect simulation. We show that our ensemble-based stochastic data-driven model outperforms a baseline deterministic encoder-decoder-based convolutional model in terms of short-term skills while remaining stable for long-term climate simulations yielding accurate climatology.
    Hyperparameter optimization of hybrid quantum neural networks for car classification. (arXiv:2205.04878v1 [quant-ph])
    Image recognition is one of the primary applications of machine learning algorithms. Nevertheless, machine learning models used in modern image recognition systems consist of millions of parameters that usually require significant computational time to be adjusted. Moreover, adjustment of model hyperparameters leads to additional overhead. Because of this, new developments in machine learning models and hyperparameter optimization techniques are required. This paper presents a quantum-inspired hyperparameter optimization technique and a hybrid quantum-classical machine learning model for supervised learning. We benchmark our hyperparameter optimization method over standard black-box objective functions and observe performance improvements in the form of reduced expected run times and fitness in response to the growth in the size of the search space. We test our approaches in a car image classification task, and demonstrate a full-scale implementation of the hybrid quantum neural network model with the tensor train hyperparameter optimization. Our tests show a qualitative and quantitative advantage over the corresponding standard classical tabular grid search approach used with a deep neural network ResNet34. A classification accuracy of 0.97 was obtained by the hybrid model after 18 iterations, whereas the classical model achieved an accuracy of 0.92 after 75 iterations.
    PyRCN: A Toolbox for Exploration and Application of Reservoir Computing Networks. (arXiv:2103.04807v3 [cs.LG] UPDATED)
    Reservoir Computing Networks (RCNs) belong to a group of machine learning techniques that project the input space non-linearly into a high-dimensional feature space, where the underlying task can be solved linearly. Popular variants of RCNs are capable of solving complex tasks equivalently to widely used deep neural networks, but with a substantially simpler training paradigm based on linear regression. In this paper, we show how to uniformly describe RCNs with small and clearly defined building blocks, and we introduce the Python toolbox PyRCN (Python Reservoir Computing Networks) for optimizing, training and analyzing RCNs on arbitrarily large datasets. The tool is based on widely-used scientific packages and complies with the scikit-learn interface specification. It provides a platform for educational and exploratory analyses of RCNs, as well as a framework to apply RCNs on complex tasks including sequence processing. With a small number of building blocks, the framework allows the implementation of numerous different RCN architectures. We provide code examples on how to set up RCNs for time series prediction and for sequence classification tasks. PyRCN is around ten times faster than reference toolboxes on a benchmark task while requiring substantially less boilerplate code.
    Optimizing over an ensemble of neural networks. (arXiv:2112.07007v2 [cs.LG] UPDATED)
    We study optimization problems where the objective function is modeled through feedforward neural networks with rectified linear unit (ReLU) activation. Recent literature has explored the use of a single neural network to model either uncertain or complex elements within an objective function. However, it is well known that ensembles of neural networks produce more stable predictions and have better generalizability than models with single neural networks, which motivates the investigation of ensembles of neural networks rather than single neural networks in decision-making pipelines. We study how to incorporate a neural network ensemble as the objective function of an optimization model and explore computational approaches for the ensuing problem. We present a mixed-integer linear program based on existing popular big-M formulations for optimizing over a single neural network. We develop a two-phase approach for our model that combines preprocessing procedures to tighten bounds for critical neurons in the neural networks with a Lagrangian relaxation-based branch-and-bound approach. Experimental evaluations of our solution methods suggest that using ensembles of neural networks yields more stable and higher quality solutions, compared to single neural networks, and that our optimization algorithm outperforms (the adaption of) a state-of-the-art approach in terms of computational time and optimality gaps.
    DeepAuditor: Distributed Online Intrusion Detection System for IoT devices via Power Side-channel Auditing. (arXiv:2106.12753v3 [cs.CR] UPDATED)
    As the number of IoT devices has increased rapidly, IoT botnets have exploited the vulnerabilities of IoT devices. However, it is still challenging to detect the initial intrusion on IoT devices prior to massive attacks. Recent studies have utilized power side-channel information to identify this intrusion behavior on IoT devices but still lack accurate models in real-time for ubiquitous botnet detection. We proposed the first online intrusion detection system called DeepAuditor for IoT devices via power auditing. To develop the real-time system, we proposed a lightweight power auditing device called Power Auditor. We also designed a distributed CNN classifier for online inference in a laboratory setting. In order to protect data leakage and reduce networking redundancy, we then proposed a privacy-preserved inference protocol via Packed Homomorphic Encryption and a sliding window protocol in our system. The classification accuracy and processing time were measured, and the proposed classifier outperformed a baseline classifier, especially against unseen patterns. We also demonstrated that the distributed CNN design is secure against any distributed components. Overall, the measurements were shown to the feasibility of our real-time distributed system for intrusion detection on IoT devices.
    Adaptation Strategies for Automated Machine Learning on Evolving Data. (arXiv:2006.06480v3 [cs.LG] UPDATED)
    Automated Machine Learning (AutoML) systems have been shown to efficiently build good models for new datasets. However, it is often not clear how well they can adapt when the data evolves over time. The main goal of this study is to understand the effect of data stream challenges such as concept drift on the performance of AutoML methods, and which adaptation strategies can be employed to make them more robust. To that end, we propose 6 concept drift adaptation strategies and evaluate their effectiveness on different AutoML approaches. We do this for a variety of AutoML approaches for building machine learning pipelines, including those that leverage Bayesian optimization, genetic programming, and random search with automated stacking. These are evaluated empirically on real-world and synthetic data streams with different types of concept drift. Based on this analysis, we propose ways to develop more sophisticated and robust AutoML techniques.
    GRU-TV: Time- and velocity-aware GRU for patient representation on multivariate clinical time-series data. (arXiv:2205.04892v1 [cs.LG])
    Electronic health records (EHRs) provide a rich repository to track a patient's health status. EHRs seek to fully document the patient's physiological status, and include data that is is high dimensional, heterogeneous, and multimodal. The significant differences in the sampling frequency of clinical variables can result in high missing rates and uneven time intervals between adjacent records in the multivariate clinical time-series data extracted from EHRs. Current studies using clinical time-series data for patient characterization view the patient's physiological status as a discrete process described by sporadically collected values, while the dynamics in patient's physiological status are time-continuous. In addition, recurrent neural networks (RNNs) models widely used for patient representation learning lack the perception of time intervals and velocity, which limits the ability of the model to represent the physiological status of the patient. In this paper, we propose an improved gated recurrent unit (GRU), namely time- and velocity-aware GRU (GRU-TV), for patient representation learning of clinical multivariate time-series data in a time-continuous manner. In proposed GRU-TV, the neural ordinary differential equations (ODEs) and velocity perception mechanism are used to perceive the time interval between records in the time-series data and changing rate of the patient's physiological status, respectively. Experimental results on two real-world clinical EHR datasets(PhysioNet2012, MIMIC-III) show that GRU-TV achieve state-of-the-art performance in computer aided diagnosis (CAD) tasks, and is more advantageous in processing sampled data.
    Learning Combinatorial Node Labeling Algorithms. (arXiv:2106.03594v3 [cs.LG] UPDATED)
    We present a novel neural architecture to solve graph optimization problems where the solution consists of arbitrary node labels, allowing us to solve hard problems like graph coloring. We train our model using reinforcement learning, specifically policy gradients, which gives us both a greedy and a probabilistic policy. Our architecture builds on a graph attention network and uses several inductive biases to improve solution quality. Our learned deterministic heuristics for graph coloring give better solutions than classical degree-based greedy heuristics and only take seconds to apply to graphs with tens of thousands of vertices. Moreover, our probabilistic policies outperform all greedy state-of-the-art coloring baselines and a machine learning baseline. Finally, we show that our approach also generalizes to other problems by evaluating it on minimum vertex cover and outperforming two greedy heuristics.
    Accelerated functional brain aging in major depressive disorder: evidence from a large scale fMRI analysis of Chinese participants. (arXiv:2205.04871v1 [q-bio.NC])
    Major depressive disorder (MDD) is one of the most common mental health conditions that has been intensively investigated for its association with brain atrophy and mortality. Recent studies reveal that the deviation between the predicted and the chronological age can be a marker of accelerated brain aging to characterize MDD. However, current conclusions are usually drawn based on structural MRI information collected from Caucasian participants. The universality of this biomarker needs to be further validated by subjects with different ethnic/racial backgrounds and by different types of data. Here we make use of the REST-meta-MDD, a large scale resting-state fMRI dataset collected from multiple cohort participants in China. We develop a stacking machine learning model based on 1101 healthy controls, which estimates a subject's chronological age from fMRI with promising accuracy. The trained model is then applied to 1276 MDD patients from 24 sites. We observe that MDD patients exhibit a $+4.43$ years ($\text{$p$} < 0.0001$, $\text{Cohen's $d$} = 0.35$, $\text{95\% CI}:1.86 - 3.91$) higher brain-predicted age difference (brain-PAD) compared to controls. In the MDD subgroup, we observe a statistically significant $+2.09$ years ($\text{$p$} < 0.05$, $\text{Cohen's $d$} = 0.134483$) brain-PAD in antidepressant users compared to medication-free patients. The statistical relationship observed is further checked by three different machine learning algorithms. The positive brain-PAD observed in participants in China confirms the presence of accelerated brain aging in MDD patients. The utilization of functional brain connectivity for age estimation verifies existing findings from a new dimension.
    Pediatric Automatic Sleep Staging: A comparative study of state-of-the-art deep learning methods. (arXiv:2108.10211v3 [eess.SP] UPDATED)
    Background: Despite the tremendous progress recently made towards automatic sleep staging in adults, it is currently unknown if the most advanced algorithms generalize to the pediatric population, which displays distinctive characteristics in overnight polysomnography (PSG). Methods: To answer the question, in this work, we conduct a large-scale comparative study on the state-of-the-art deep learning methods for pediatric automatic sleep staging. Six different deep neural networks with diverging features are adopted to evaluate a sample of more than 1,200 children across a wide spectrum of obstructive sleep apnea (OSA) severity. Results: Our experimental results show that the individual performance of automated pediatric sleep stagers when evaluated on new subjects is equivalent to the expert-level one reported on adults. Combining the six stagers into ensemble models further boosts the staging accuracy, reaching an overall accuracy of 88.8%, a Cohen's kappa of 0.852, and a macro F1-score of 85.8%. At the same time, the ensemble models lead to reduced predictive uncertainty. The results also show that the studied algorithms and their ensembles are robust to concept drift when the training and test data were recorded seven months apart and after clinical intervention. Conclusion: However, we show that the improvements in the staging performance are not necessarily clinically significant although the ensemble models lead to more favorable clinical measures than the six standalone models. Significance: Detailed analyses further demonstrate "almost perfect" agreement between the automatic stagers to one another and their similar patterns on the staging errors, suggesting little room for improvement.
    Differentially Private Learning with Adaptive Clipping. (arXiv:1905.03871v5 [cs.LG] UPDATED)
    Existing approaches for training neural networks with user-level differential privacy (e.g., DP Federated Averaging) in federated learning (FL) settings involve bounding the contribution of each user's model update by clipping it to some constant value. However there is no good a priori setting of the clipping norm across tasks and learning settings: the update norm distribution depends on the model architecture and loss, the amount of data on each device, the client learning rate, and possibly various other parameters. We propose a method wherein instead of a fixed clipping norm, one clips to a value at a specified quantile of the update norm distribution, where the value at the quantile is itself estimated online, with differential privacy. The method tracks the quantile closely, uses a negligible amount of privacy budget, is compatible with other federated learning technologies such as compression and secure aggregation, and has a straightforward joint DP analysis with DP-FedAvg. Experiments demonstrate that adaptive clipping to the median update norm works well across a range of realistic federated learning tasks, sometimes outperforming even the best fixed clip chosen in hindsight, and without the need to tune any clipping hyperparameter.
    Adjusted Expected Improvement for Cumulative Regret Minimization in Noisy Bayesian Optimization. (arXiv:2205.04901v1 [cs.LG])
    The expected improvement (EI) is one of the most popular acquisition functions for Bayesian optimization (BO) and has demonstrated good empirical performances in many applications for the minimization of simple regret. However, under the evaluation metric of cumulative regret, the performance of EI may not be competitive, and its existing theoretical regret upper bound still has room for improvement. To adapt the EI for better performance under cumulative regret, we introduce a novel quantity called the evaluation cost which is compared against the acquisition function, and with this, develop the expected improvement-cost (EIC) algorithm. In each iteration of EIC, a new point with the largest acquisition function value is sampled, only if that value exceeds its evaluation cost. If none meets this criteria, the current best point is resampled.This evaluation cost quantifies the potential downside of sampling a point, which is important under the cumulative regret metric as the objective function value in every iteration affects the performance measure. We further establish in theory a near-optimal regret upper bound of EIC for the squared-exponential covariance kernel under mild regularity conditions, and perform experiments to illustrate the improvement of EIC over several popular BO algorithms.
    White-box Testing of NLP models with Mask Neuron Coverage. (arXiv:2205.05050v1 [cs.CL])
    Recent literature has seen growing interest in using black-box strategies like CheckList for testing the behavior of NLP models. Research on white-box testing has developed a number of methods for evaluating how thoroughly the internal behavior of deep models is tested, but they are not applicable to NLP models. We propose a set of white-box testing methods that are customized for transformer-based NLP models. These include Mask Neuron Coverage (MNCOVER) that measures how thoroughly the attention layers in models are exercised during testing. We show that MNCOVER can refine testing suites generated by CheckList by substantially reduce them in size, for more than 60\% on average, while retaining failing tests -- thereby concentrating the fault detection power of the test suite. Further we show how MNCOVER can be used to guide CheckList input generation, evaluate alternative NLP testing methods, and drive data augmentation to improve accuracy.
    Deep learning based Chinese text sentiment mining and stock market correlation research. (arXiv:2205.04743v1 [q-fin.CP])
    We explore how to crawl financial forum data such as stock bars and combine them with deep learning models for sentiment analysis. In this paper, we will use the BERT model to train against the financial corpus and predict the SZSE Component Index, and find that applying the BERT model to the financial corpus through the maximum information coefficient comparison study. The obtained sentiment features will be able to reflect the fluctuations in the stock market and help to improve the prediction accuracy effectively. Meanwhile, this paper combines deep learning with financial text, in further exploring the mechanism of investor sentiment on stock market through deep learning method, which will be beneficial for national regulators and policy departments to develop more reasonable policy guidelines for maintaining the stability of stock market.
    Control Prefixes for Parameter-Efficient Text Generation. (arXiv:2110.08329v2 [cs.CL] UPDATED)
    Prefix-tuning is a powerful lightweight technique for adapting a large pre-trained language model to a downstream application. However, it uses the same dataset-level tuned prompt for all examples in the dataset. We extend this idea and propose a dynamic method, Control Prefixes, which allows for the inclusion of conditional input-dependent information, combining the benefits of prompt tuning and controlled generation. The method incorporates attribute-level learnable representations into different layers of a pre-trained transformer, allowing for the generated text to be guided in a particular direction. We provide a systematic evaluation of the technique and apply it to five datasets from the GEM benchmark for natural language generation (NLG). Although the aim is to develop a parameter-efficient model, we show Control Prefixes can even outperform full fine-tuning methods. We present state-of-the-art results on several data-to-text datasets, including WebNLG.  ( 2 min )
    Measure and Improve Robustness in NLP Models: A Survey. (arXiv:2112.08313v2 [cs.CL] UPDATED)
    As NLP models achieved state-of-the-art performances over benchmarks and gained wide applications, it has been increasingly important to ensure the safe deployment of these models in the real world, e.g., making sure the models are robust against unseen or challenging scenarios. Despite robustness being an increasingly studied topic, it has been separately explored in applications like vision and NLP, with various definitions, evaluation and mitigation strategies in multiple lines of research. In this paper, we aim to provide a unifying survey of how to define, measure and improve robustness in NLP. We first connect multiple definitions of robustness, then unify various lines of work on identifying robustness failures and evaluating models' robustness. Correspondingly, we present mitigation strategies that are data-driven, model-driven, and inductive-prior-based, with a more systematic view of how to effectively improve robustness in NLP models. Finally, we conclude by outlining open challenges and future directions to motivate further research in this area.  ( 2 min )
    Large Neighborhood Search based on Neural Construction Heuristics. (arXiv:2205.00772v2 [cs.LG] UPDATED)
    We propose a Large Neighborhood Search (LNS) approach utilizing a learned construction heuristic based on neural networks as repair operator to solve the vehicle routing problem with time windows (VRPTW). Our method uses graph neural networks to encode the problem and auto-regressively decodes a solution and is trained with reinforcement learning on the construction task without requiring any labels for supervision. The neural repair operator is combined with a local search routine, heuristic destruction operators and a selection procedure applied to a small population to arrive at a sophisticated solution approach. The key idea is to use the learned model to re-construct the partially destructed solution and to introduce randomness via the destruction heuristics (or the stochastic policy itself) to effectively explore a large neighborhood.  ( 2 min )
    Explainable Deep Learning Methods in Medical Diagnosis: A Survey. (arXiv:2205.04766v1 [eess.IV])
    The remarkable success of deep learning has prompted interest in its application to medical diagnosis. Even tough state-of-the-art deep learning models have achieved human-level accuracy on the classification of different types of medical data, these models are hardly adopted in clinical workflows, mainly due to their lack of interpretability. The black-box-ness of deep learning models has raised the need for devising strategies to explain the decision process of these models, leading to the creation of the topic of eXplainable Artificial Intelligence (XAI). In this context, we provide a thorough survey of XAI applied to medical diagnosis, including visual, textual, and example-based explanation methods. Moreover, this work reviews the existing medical imaging datasets and the existing metrics for evaluating the quality of the explanations . Complementary to most existing surveys, we include a performance comparison among a set of report generation-based methods. Finally, the major challenges in applying XAI to medical imaging are also discussed.  ( 2 min )
    Are Quantum Computers Practical Yet? A Case for Feature Selection in Recommender Systems using Tensor Networks. (arXiv:2205.04490v1 [cs.IR])
    Collaborative filtering models generally perform better than content-based filtering models and do not require careful feature engineering. However, in the cold-start scenario collaborative information may be scarce or even unavailable, whereas the content information may be abundant, but also noisy and expensive to acquire. Thus, selection of particular features that improve cold-start recommendations becomes an important and non-trivial task. In the recent approach by Nembrini et al., the feature selection is driven by the correlational compatibility between collaborative and content-based models. The problem is formulated as a Quadratic Unconstrained Binary Optimization (QUBO) which, due to its NP-hard complexity, is solved using Quantum Annealing on a quantum computer provided by D-Wave. Inspired by the reported results, we contend the idea that current quantum annealers are superior for this problem and instead focus on classical algorithms. In particular, we tackle QUBO via TTOpt, a recently proposed black-box optimizer based on tensor networks and multilinear algebra. We show the computational feasibility of this method for large problems with thousands of features, and empirically demonstrate that the solutions found are comparable to the ones obtained with D-Wave across all examined datasets.  ( 2 min )
    Unsupervised Belief Representation Learning with Information-Theoretic Variational Graph Auto-Encoders. (arXiv:2110.00210v5 [cs.SI] UPDATED)
    This paper develops a novel unsupervised algorithm for belief representation learning in polarized networks that (i) uncovers the latent dimensions of the underlying belief space and (ii) jointly embeds users and content items (that they interact with) into that space in a manner that facilitates a number of downstream tasks, such as stance detection, stance prediction, and ideology mapping. Inspired by total correlation in information theory, we propose the Information-Theoretic Variational Graph Auto-Encoder (InfoVGAE) that learns to project both users and content items (e.g., posts that represent user views) into an appropriate disentangled latent space. To better disentangle latent variables in that space, we develop a total correlation regularization module, a Proportional-Integral (PI) control module, and adopt rectified Gaussian distribution to ensure the orthogonality. The latent representation of users and content can then be used to quantify their ideological leaning and detect/predict their stances on issues. We evaluate the performance of the proposed InfoVGAE on three real-world datasets, of which two are collected from Twitter and one from U.S. Congress voting records. The evaluation results show that our model outperforms state-of-the-art unsupervised models by reducing 10.5% user clustering errors and achieving 12.1% higher F1 scores for stance separation of content items. In addition, InfoVGAE produces a comparable result with supervised models. We also discuss its performance on stance prediction and user ranking within ideological groups.  ( 3 min )
    Theory of Quantum Generative Learning Models with Maximum Mean Discrepancy. (arXiv:2205.04730v1 [quant-ph])
    The intrinsic probabilistic nature of quantum mechanics invokes endeavors of designing quantum generative learning models (QGLMs) with computational advantages over classical ones. To date, two prototypical QGLMs are quantum circuit Born machines (QCBMs) and quantum generative adversarial networks (QGANs), which approximate the target distribution in explicit and implicit ways, respectively. Despite the empirical achievements, the fundamental theory of these models remains largely obscure. To narrow this knowledge gap, here we explore the learnability of QCBMs and QGANs from the perspective of generalization when their loss is specified to be the maximum mean discrepancy. Particularly, we first analyze the generalization ability of QCBMs and identify their superiorities when the quantum devices can directly access the target distribution and the quantum kernels are employed. Next, we prove how the generalization error bound of QGANs depends on the employed Ansatz, the number of qudits, and input states. This bound can be further employed to seek potential quantum advantages in Hamiltonian learning tasks. Numerical results of QGLMs in approximating quantum states, Gaussian distribution, and ground states of parameterized Hamiltonians accord with the theoretical analysis. Our work opens the avenue for quantitatively understanding the power of quantum generative learning models.  ( 2 min )
    Matrix and graph representations of vine copula structures. (arXiv:2205.04783v1 [stat.ML])
    Vine copulas can efficiently model a large portion of probability distributions. This paper focuses on a more thorough understanding of their structures. We are building on well-known existing constructions to represent vine copulas with graphs as well as matrices. The graph representations include the regular, cherry and chordal graph sequence structures, which we show equivalence between. Importantly we also show that when a perfect elimination ordering of a vine structure is given, then it can always be uniquely represented with a matrix. O. M. N\'apoles has shown a way to represent them in a matrix, and we algorithmify this previous approach, while also showing a new method for constructing such a matrix, through cherry tree sequences. Lastly, we prove that these two matrix-building algorithms are equivalent if the same perfect elimination ordering is being used.  ( 2 min )
    Modeling Regime Shifts in Multiple Time Series. (arXiv:2109.09692v3 [cs.LG] UPDATED)
    We investigate the problem of discovering and modeling regime shifts in an ecosystem comprising multiple time series known as co-evolving time series. Regime shifts refer to the changing behaviors exhibited by series at different time intervals. Learning these changing behaviors is a key step toward time series forecasting. While advances have been made, existing methods suffer from one or more of the following shortcomings: (1) failure to take relationships between time series into consideration for discovering regimes in multiple time series; (2) lack of an effective approach that models time-dependent behaviors exhibited by series; (3) difficulties in handling data discontinuities which may be informative. Most of the existing methods are unable to handle all of these three issues in a unified framework. This, therefore, motivates our effort to devise a principled approach for modeling interactions and time-dependency in co-evolving time series. Specifically, we model an ecosystem of multiple time series by summarizing the heavy ensemble of time series into a lighter and more meaningful structure called a \textit{mapping grid}. By using the mapping grid, our model first learns time series behavioral dependencies through a dynamic network representation, then learns the regime transition mechanism via a full time-dependent Cox regression model. The originality of our approach lies in modeling interactions between time series in regime identification and in modeling time-dependent regime transition probabilities, usually assumed to be static in existing work.  ( 2 min )
    DeepTag: A General Framework for Fiducial Marker Design and Detection. (arXiv:2105.13731v2 [cs.CV] UPDATED)
    A fiducial marker system usually consists of markers, a detection algorithm, and a coding system. The appearance of markers and the detection robustness are generally limited by the existing detection algorithms, which are hand-crafted with traditional low-level image processing techniques. Furthermore, a sophisticatedly designed coding system is required to overcome the shortcomings of both markers and detection algorithms. To improve the flexibility and robustness in various applications, we propose a general deep learning based framework, DeepTag, for fiducial marker design and detection. DeepTag not only supports detection of a wide variety of existing marker families, but also makes it possible to design new marker families with customized local patterns. Moreover, we propose an effective procedure to synthesize training data on the fly without manual annotations. Thus, DeepTag can easily adapt to existing and newly-designed marker families. To validate DeepTag and existing methods, beside existing datasets, we further collect a new large and challenging dataset where markers are placed in different view distances and angles. Experiments show that DeepTag well supports different marker families and greatly outperforms the existing methods in terms of both detection robustness and pose accuracy. Both code and dataset are available at https://herohuyongtao.github.io/research/publications/deep-tag/.  ( 2 min )
    THOR: Threshold-Based Ranking Loss for Ordinal Regression. (arXiv:2205.04864v1 [cs.LG])
    In this work, we present a regression-based ordinal regression algorithm for supervised classification of instances into ordinal categories. In contrast to previous methods, in this work the decision boundaries between categories are predefined, and the algorithm learns to project the input examples onto their appropriate scores according to these predefined boundaries. This is achieved by adding a novel threshold-based pairwise loss function that aims at minimizing the regression error, which in turn minimizes the Mean Absolute Error (MAE) measure. We implemented our proposed architecture-agnostic method using the CNN-framework for feature extraction. Experimental results on five real-world benchmarks demonstrate that the proposed algorithm achieves the best MAE results compared to state-of-the-art ordinal regression algorithms.  ( 2 min )
    Designing a Recurrent Neural Network to Learn a Motion Planner for High-Dimensional Inputs. (arXiv:2205.04799v1 [cs.RO])
    The use of machine learning in the self-driving industry has boosted a number of recent advancements. In particular, the usage of large deep learning models in the perception and prediction stack have proved quite successful, but there still lacks significant literature on the use of machine learning in the planning stack. The current state of the art in the planning stack often relies on fast constrained optimization or rule-based approaches. Both of these techniques fail to address a significant number of fundamental problems that would allow the vehicle to operate more similarly to that of human drivers. In this paper, we attempt to design a basic deep learning system to approach this problem. Furthermore, the main underlying goal of this paper is to demonstrate the potential uses of machine learning in the planning stack for autonomous vehicles (AV) and provide a baseline work for ongoing and future research.  ( 2 min )
    Knowledge Augmented Machine Learning with Applications in Autonomous Driving: A Survey. (arXiv:2205.04712v1 [cs.LG])
    The existence of representative datasets is a prerequisite of many successful artificial intelligence and machine learning models. However, the subsequent application of these models often involves scenarios that are inadequately represented in the data used for training. The reasons for this are manifold and range from time and cost constraints to ethical considerations. As a consequence, the reliable use of these models, especially in safety-critical applications, is a huge challenge. Leveraging additional, already existing sources of knowledge is key to overcome the limitations of purely data-driven approaches, and eventually to increase the generalization capability of these models. Furthermore, predictions that conform with knowledge are crucial for making trustworthy and safe decisions even in underrepresented scenarios. This work provides an overview of existing techniques and methods in the literature that combine data-based models with existing knowledge. The identified approaches are structured according to the categories integration, extraction and conformity. Special attention is given to applications in the field of autonomous driving.  ( 2 min )
    Secure and Private Source Coding with Private Key and Decoder Side Information. (arXiv:2205.05068v1 [cs.IT])
    The problem of secure source coding with multiple terminals is extended by considering a remote source whose noisy measurements are the correlated random variables used for secure source reconstruction. The main additions to the problem include 1) all terminals noncausally observe a noisy measurement of the remote source; 2) a private key is available to all legitimate terminals; 3) the public communication link between the encoder and decoder is rate-limited; 4) the secrecy leakage to the eavesdropper is measured with respect to the encoder input, whereas the privacy leakage is measured with respect to the remote source. Exact rate regions are characterized for a lossy source coding problem with a private key, remote source, and decoder side information under security, privacy, communication, and distortion constraints. By replacing the distortion constraint with a reliability constraint, we obtain the exact rate region also for the lossless case. Furthermore, the lossy rate region for scalar discrete-time Gaussian sources and measurement channels is established.  ( 2 min )
    Explainable Data Imputation using Constraints. (arXiv:2205.04731v1 [cs.AI])
    Data values in a dataset can be missing or anomalous due to mishandling or human error. Analysing data with missing values can create bias and affect the inferences. Several analysis methods, such as principle components analysis or singular value decomposition, require complete data. Many approaches impute numeric data and some do not consider dependency of attributes on other attributes, while some require human intervention and domain knowledge. We present a new algorithm for data imputation based on different data type values and their association constraints in data, which are not handled currently by any system. We show experimental results using different metrics comparing our algorithm with state of the art imputation techniques. Our algorithm not only imputes the missing values but also generates human readable explanations describing the significance of attributes used for every imputation.  ( 2 min )
    Sensible AI: Re-imagining Interpretability and Explainability using Sensemaking Theory. (arXiv:2205.05057v1 [cs.HC])
    Understanding how ML models work is a prerequisite for responsibly designing, deploying, and using ML-based systems. With interpretability approaches, ML can now offer explanations for its outputs to aid human understanding. Though these approaches rely on guidelines for how humans explain things to each other, they ultimately solve for improving the artifact -- an explanation. In this paper, we propose an alternate framework for interpretability grounded in Weick's sensemaking theory, which focuses on who the explanation is intended for. Recent work has advocated for the importance of understanding stakeholders' needs -- we build on this by providing concrete properties (e.g., identity, social context, environmental cues, etc.) that shape human understanding. We use an application of sensemaking in organizations as a template for discussing design guidelines for Sensible AI, AI that factors in the nuances of human cognition when trying to explain itself.  ( 2 min )
    ALLSH: Active Learning Guided by Local Sensitivity and Hardness. (arXiv:2205.04980v1 [cs.CL])
    Active learning, which effectively collects informative unlabeled data for annotation, reduces the demand for labeled data. In this work, we propose to retrieve unlabeled samples with a local sensitivity and hardness-aware acquisition function. The proposed method generates data copies through local perturbations and selects data points whose predictive likelihoods diverge the most from their copies. We further empower our acquisition function by injecting the select-worst case perturbation. Our method achieves consistent gains over the commonly used active learning strategies in various classification tasks. Furthermore, we observe consistent improvements over the baselines on the study of prompt selection in prompt-based few-shot learning. These experiments demonstrate that our acquisition guided by local sensitivity and hardness can be effective and beneficial for many NLP tasks.  ( 2 min )
    Tensor-based Collaborative Filtering With Smooth Ratings Scale. (arXiv:2205.05070v1 [cs.IR])
    Conventional collaborative filtering techniques don't take into consideration the effect of discrepancy in users' rating perception. Some users may rarely give 5 stars to items while others almost always assign 5 stars to the chosen item. Even if they had experience with the same items this systematic discrepancy in their evaluation style will lead to the systematic errors in the ability of recommender system to effectively extract right patterns from data. To mitigate this problem we introduce the ratings' similarity matrix which represents the dependency between different values of ratings on the population level. Hence, if on average the correlations between ratings exist, it is possible to improve the quality of proposed recommendations by off-setting the effect of either shifted down or shifted up users' rates.  ( 2 min )
    Nearly Minimax Optimal Regret for Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation. (arXiv:2102.07301v2 [cs.LG] UPDATED)
    We study reinforcement learning in an infinite-horizon average-reward setting with linear function approximation, where the transition probability function of the underlying Markov Decision Process (MDP) admits a linear form over a feature mapping of the current state, action, and next state. We propose a new algorithm UCRL2-VTR, which can be seen as an extension of the UCRL2 algorithm with linear function approximation. We show that UCRL2-VTR with Bernstein-type bonus can achieve a regret of $\tilde{O}(d\sqrt{DT})$, where $d$ is the dimension of the feature mapping, $T$ is the horizon, and $\sqrt{D}$ is the diameter of the MDP. We also prove a matching lower bound $\tilde{\Omega}(d\sqrt{DT})$, which suggests that the proposed UCRL2-VTR is minimax optimal up to logarithmic factors. To the best of our knowledge, our algorithm is the first nearly minimax optimal RL algorithm with function approximation in the infinite-horizon average-reward setting.  ( 2 min )
    Reconstruction Enhanced Multi-View Contrastive Learning for Anomaly Detection on Attributed Networks. (arXiv:2205.04816v1 [cs.LG])
    Detecting abnormal nodes from attributed networks is of great importance in many real applications, such as financial fraud detection and cyber security. This task is challenging due to both the complex interactions between the anomalous nodes with other counterparts and their inconsistency in terms of attributes. This paper proposes a self-supervised learning framework that jointly optimizes a multi-view contrastive learning-based module and an attribute reconstruction-based module to more accurately detect anomalies on attributed networks. Specifically, two contrastive learning views are firstly established, which allow the model to better encode rich local and global information related to the abnormality. Motivated by the attribute consistency principle between neighboring nodes, a masked autoencoder-based reconstruction module is also introduced to identify the nodes which have large reconstruction errors, then are regarded as anomalies. Finally, the two complementary modules are integrated for more accurately detecting the anomalous nodes. Extensive experiments conducted on five benchmark datasets show our model outperforms current state-of-the-art models.  ( 2 min )
    Universal Caching. (arXiv:2205.04860v1 [cs.IT])
    In the learning literature, the performance of an online policy is commonly measured in terms of the static regret metric, which compares the cumulative loss of an online policy to that of an optimal benchmark in hindsight. In the definition of static regret, the benchmark policy remains fixed throughout the time horizon. Naturally, the resulting regret bounds become loose in non-stationary settings where fixed benchmarks often suffer from poor performance. In this paper, we investigate a stronger notion of regret minimization in the context of an online caching problem. In particular, we allow the action of the offline benchmark at any round to be decided by a finite state predictor containing arbitrarily many states. Using ideas from the universal prediction literature in information theory, we propose an efficient online caching policy with an adaptive sub-linear regret bound. To the best of our knowledge, this is the first data-dependent regret bound known for the universal caching problem. We establish this result by combining a recently-proposed online caching policy with an incremental parsing algorithm, e.g., Lempel-Ziv '78. Our methods also yield a simpler learning-theoretic proof of the improved regret bound as opposed to the more involved and problem-specific combinatorial arguments used in the earlier works.  ( 2 min )
    Weakly-supervised segmentation of referring expressions. (arXiv:2205.04725v1 [cs.CV])
    Visual grounding localizes regions (boxes or segments) in the image corresponding to given referring expressions. In this work we address image segmentation from referring expressions, a problem that has so far only been addressed in a fully-supervised setting. A fully-supervised setup, however, requires pixel-wise supervision and is hard to scale given the expense of manual annotation. We therefore introduce a new task of weakly-supervised image segmentation from referring expressions and propose Text grounded semantic SEGgmentation (TSEG) that learns segmentation masks directly from image-level referring expressions without pixel-level annotations. Our transformer-based method computes patch-text similarities and guides the classification objective during training with a new multi-label patch assignment mechanism. The resulting visual grounding model segments image regions corresponding to given natural language expressions. Our approach TSEG demonstrates promising results for weakly-supervised referring expression segmentation on the challenging PhraseCut and RefCOCO datasets. TSEG also shows competitive performance when evaluated in a zero-shot setting for semantic segmentation on Pascal VOC.  ( 2 min )
    Flow Completion Network: Inferring the Fluid Dynamics from Incomplete Flow Information using Graph Neural Networks. (arXiv:2205.04739v1 [physics.flu-dyn])
    This paper introduces a novel neural network -- the flow completion network (FCN) -- to infer the fluid dynamics, including the flow field and the force acting on the body, from the incomplete data based on Graph Convolution Attention Network. The FCN is composed of several graph convolution layers and spatial attention layers. It is designed to infer the velocity field and the vortex force contribution of the flow field when combined with the vortex force map (VFM) method. Compared with other neural networks adopted in fluid dynamics, the FCN is capable of dealing with both structured data and unstructured data. The performance of the proposed FCN is assessed by the computational fluid dynamics (CFD) data on the flow field around a circular cylinder. The force coefficients predicted by our model are validated against those obtained directly from CFD. Moreover, it is shown that our model effectively utilizes the existing flow field information and the gradient information simultaneously, giving a better performance than the traditional CNN-based and DNN-based models.  ( 2 min )
    AI training resources for GLAM: a snapshot. (arXiv:2205.04738v1 [cs.LG])
    We take a snapshot of current resources available for teaching and learning AI with a focus on the Galleries, Libraries, Archives and Museums (GLAM) community. The review was carried out in 2021 and 2022. The review provides an overview of material we identified as being relevant, offers a description of this material and makes recommendations for future work in this area.  ( 2 min )
    A spatial-temporal short-term traffic flow prediction model based on dynamical-learning graph convolution mechanism. (arXiv:2205.04762v1 [cs.LG])
    Short-term traffic flow prediction is a vital branch of the Intelligent Traffic System (ITS) and plays an important role in traffic management. Graph convolution network (GCN) is widely used in traffic prediction models to better deal with the graphical structure data of road networks. However, the influence weights among different road sections are usually distinct in real life, and hard to be manually analyzed. Traditional GCN mechanism, relying on manually-set adjacency matrix, is unable to dynamically learn such spatial pattern during the training. To deal with this drawback, this paper proposes a novel location graph convolutional network (Location-GCN). Location-GCN solves this problem by adding a new learnable matrix into the GCN mechanism, using the absolute value of this matrix to represent the distinct influence levels among different nodes. Then, long short-term memory (LSTM) is employed in the proposed traffic prediction model. Moreover, Trigonometric function encoding is used in this study to enable the short-term input sequence to convey the long-term periodical information. Ultimately, the proposed model is compared with the baseline models and evaluated on two real word traffic flow datasets. The results show our model is more accurate and robust on both datasets than other representative traffic prediction models.  ( 2 min )
    A Verification Framework for Certifying Learning-Based Safety-Critical Aviation Systems. (arXiv:2205.04590v1 [eess.SY])
    We present a safety verification framework for design-time and run-time assurance of learning-based components in aviation systems. Our proposed framework integrates two novel methodologies. From the design-time assurance perspective, we propose offline mixed-fidelity verification tools that incorporate knowledge from different levels of granularity in simulated environments. From the run-time assurance perspective, we propose reachability- and statistics-based online monitoring and safety guards for a learning-based decision-making model to complement the offline verification methods. This framework is designed to be loosely coupled among modules, allowing the individual modules to be developed using independent methodologies and techniques, under varying circumstances and with different tool access. The proposed framework offers feasible solutions for meeting system safety requirements at different stages throughout the system development and deployment cycle, enabling the continuous learning and assessment of the system product.  ( 2 min )
    On Causality in Domain Adaptation and Semi-Supervised Learning: an Information-Theoretic Analysis. (arXiv:2205.04641v1 [cs.LG])
    The establishment of the link between causality and unsupervised domain adaptation (UDA)/semi-supervised learning (SSL) has led to methodological advances in these learning problems in recent years. However, a formal theory that explains the role of causality in the generalization performance of UDA/SSL is still lacking. In this paper, we consider the UDA/SSL setting where we access m labeled source data and n unlabeled target data as training instances under a parametric probabilistic model. We study the learning performance (e.g., excess risk) of prediction in the target domain. Specifically, we distinguish two scenarios: the learning problem is called causal learning if the feature is the cause and the label is the effect, and is called anti-causal learning otherwise. We show that in causal learning, the excess risk depends on the size of the source sample at a rate of O(1/m) only if the labelling distribution between the source and target domains remains unchanged. In anti-causal learning, we show that the unlabeled data dominate the performance at a rate of typically O(1/n). Our analysis is based on the notion of potential outcome random variables and information theory. These results bring out the relationship between the data sample size and the hardness of the learning problem with different causal mechanisms.  ( 2 min )
    Surreal-GAN:Semi-Supervised Representation Learning via GAN for uncovering heterogeneous disease-related imaging patterns. (arXiv:2205.04523v1 [cs.LG])
    A plethora of machine learning methods have been applied to imaging data, enabling the construction of clinically relevant imaging signatures of neurological and neuropsychiatric diseases. Oftentimes, such methods don't explicitly model the heterogeneity of disease effects, or approach it via nonlinear models that are not interpretable. Moreover, unsupervised methods may parse heterogeneity that is driven by nuisance confounding factors that affect brain structure or function, rather than heterogeneity relevant to a pathology of interest. On the other hand, semi-supervised clustering methods seek to derive a dichotomous subtype membership, ignoring the truth that disease heterogeneity spatially and temporally extends along a continuum. To address the aforementioned limitations, herein, we propose a novel method, termed Surreal-GAN (Semi-SUpeRvised ReprEsentAtion Learning via GAN). Using cross-sectional imaging data, Surreal-GAN dissects underlying disease-related heterogeneity under the principle of semi-supervised clustering (cluster mappings from normal control to patient), proposes a continuously dimensional representation, and infers the disease severity of patients at individual level along each dimension. The model first learns a transformation function from normal control (CN) domain to the patient (PT) domain with latent variables controlling transformation directions. An inverse mapping function together with regularization on function continuity, pattern orthogonality and monotonicity was also imposed to make sure that the transformation function captures necessarily meaningful imaging patterns with clinical significance. We first validated the model through extensive semi-synthetic experiments, and then demonstrate its potential in capturing biologically plausible imaging patterns in Alzheimer's disease (AD).  ( 2 min )
    Affective Medical Estimation and Decision Making via Visualized Learning and Deep Learning. (arXiv:2205.04599v1 [cs.LG])
    With the advent of sophisticated machine learning (ML) techniques and the promising results they yield, especially in medical applications, where they have been investigated for different tasks to enhance the decision-making process. Since visualization is such an effective tool for human comprehension, memorization, and judgment, we have presented a first-of-its-kind estimation approach we refer to as Visualized Learning for Machine Learning (VL4ML) that not only can serve to assist physicians and clinicians in making reasoned medical decisions, but it also allows to appreciate the uncertainty visualization, which could raise incertitude in making the appropriate classification or prediction. For the proof of concept, and to demonstrate the generalized nature of this visualized estimation approach, five different case studies are examined for different types of tasks including classification, regression, and longitudinal prediction. A survey analysis with more than 100 individuals is also conducted to assess users' feedback on this visualized estimation method. The experiments and the survey demonstrate the practical merits of the VL4ML that include: (1) appreciating visually clinical/medical estimations; (2) getting closer to the patients' preferences; (3) improving doctor-patient communication, and (4) visualizing the uncertainty introduced through the black box effect of the deployed ML algorithm. All the source codes are shared via a GitHub repository.  ( 2 min )
    Statistical Guarantees for Approximate Stationary Points of Simple Neural Networks. (arXiv:2205.04491v1 [cs.LG])
    Since statistical guarantees for neural networks are usually restricted to global optima of intricate objective functions, it is not clear whether these theories really explain the performances of actual outputs of neural-network pipelines. The goal of this paper is, therefore, to bring statistical theory closer to practice. We develop statistical guarantees for simple neural networks that coincide up to logarithmic factors with the global optima but apply to stationary points and the points nearby. These results support the common notion that neural networks do not necessarily need to be optimized globally from a mathematical perspective. More generally, despite being limited to simple neural networks for now, our theories make a step forward in describing the practical properties of neural networks in mathematical terms.  ( 2 min )
    Differentiable Electron Microscopy Simulation: Methods and Applications for Visualization. (arXiv:2205.04464v1 [q-bio.QM])
    We propose a new microscopy simulation system that can depict atomistic models in a micrograph visual style, similar to results of physical electron microscopy imaging. This system is scalable, able to represent simulation of electron microscopy of tens of viral particles and synthesizes the image faster than previous methods. On top of that, the simulator is differentiable, both its deterministic as well as stochastic stages that form signal and noise representations in the micrograph. This notable property has the capability for solving inverse problems by means of optimization and thus allows for generation of microscopy simulations using the parameter settings estimated from real data. We demonstrate this learning capability through two applications: (1) estimating the parameters of the modulation transfer function defining the detector properties of the simulated and real micrographs, and (2) denoising the real data based on parameters trained from the simulated examples. While current simulators do not support any parameter estimation due to their forward design, we show that the results obtained using estimated parameters are very similar to the results of real micrographs. Additionally, we evaluate the denoising capabilities of our approach and show that the results showed an improvement over state-of-the-art methods. Denoised micrographs exhibit less noise in the tilt-series tomography reconstructions, ultimately reducing the visual dominance of noise in direct volume rendering of microscopy tomograms.  ( 2 min )
    Serving and Optimizing Machine Learning Workflows on Heterogeneous Infrastructures. (arXiv:2205.04713v1 [cs.LG])
    With the advent of ubiquitous deployment of smart devices and the Internet of Things, data sources for machine learning inference have increasingly moved to the edge of the network. Existing machine learning inference platforms typically assume a homogeneous infrastructure and do not take into account the more complex and tiered computing infrastructure that includes edge devices, local hubs, edge datacenters, and cloud datacenters. On the other hand, recent machine learning efforts have provided viable solutions for model compression, pruning and quantization for heterogeneous environments; for a machine learning model, now we may easily find or even generate a series of models with different tradeoffs between accuracy and efficiency. We design and implement JellyBean, a framework for serving and optimizing machine learning inference workflows on heterogeneous infrastructures. Given service-level objectives (e.g., throughput, accuracy), JellyBean automatically selects the most cost-efficient models that met the accuracy target and decides how to deploy them across different tiers of infrastructures. Evaluations show that JellyBean reduces the total serving cost of visual question answering by up to 58%, and vehicle tracking from the NVIDIA AI City Challenge by up to 36% compared with state-of-the-art model selection and worker assignment solutions. JellyBean also outperforms prior ML serving systems (e.g., Spark on the cloud) up to 5x in serving costs.  ( 2 min )
    DNS based In-Browser Cryptojacking Detection. (arXiv:2205.04685v1 [cs.CR])
    The metadata aspect of Domain Names (DNs) enables us to perform a behavioral study of DNs and detect if a DN is involved in in-browser cryptojacking. Thus, we are motivated to study different temporal and behavioral aspects of DNs involved in cryptojacking. We use temporal features such as query frequency and query burst along with graph-based features such as degree and diameter, and non-temporal features such as the string-based to detect if a DNs is suspect to be involved in the in-browser cryptojacking. Then, we use them to train the Machine Learning (ML) algorithms over different temporal granularities such as 2 hours datasets and complete dataset. Our results show DecisionTrees classifier performs the best with 59.5% Recall on cryptojacked DN, while for unsupervised learning, K-Means with K=2 perform the best. Similarity analysis of the features reveals a minimal divergence between the cryptojacking DNs and other already known malicious DNs. It also reveals the need for improvements in the feature set of state-of-the-art methods to improve their accuracy in detecting in-browser cryptojacking. As added analysis, our signature-based analysis identifies that none-of-the Indian Government websites were involved in cryptojacking during October-December 2021. However, based on the resource utilization, we identify 10 DNs with different properties than others.  ( 2 min )
    Stabilized Doubly Robust Learning for Recommendation on Data Missing Not at Random. (arXiv:2205.04701v1 [cs.LG])
    In recommender systems, users always choose favorite items to rate, which results in data missing not at random and poses a great challenge for unbiased evaluation and learning of prediction models. Currently, the doubly robust (DR) method and its variants have been widely studied and demonstrate superior performance. However, we show that DR methods are unstable to extremely small propensities and rely on extrapolations, resulting in sub-optimal performances. In this paper, we propose a stabilized doubly robust (SDR) estimator to address the above limitations while retaining double robustness. Theoretical analysis shows that SDR has bounded bias, variance and generalization error bound under inaccurate imputed errors and arbitrarily small propensities. In addition, we propose a novel learning approach for SDR that updates the imputation, propensity, and prediction models cyclically, achieving more stable and accurate predictions. Extensive experiments show that our approach significantly outperforms the existing methods.  ( 2 min )
    Improving genetic risk prediction across diverse population by disentangling ancestry representations. (arXiv:2205.04673v1 [cs.LG])
    Risk prediction models using genetic data have seen increasing traction in genomics. However, most of the polygenic risk models were developed using data from participants with similar (mostly European) ancestry. This can lead to biases in the risk predictors resulting in poor generalization when applied to minority populations and admixed individuals such as African Americans. To address this bias, largely due to the prediction models being confounded by the underlying population structure, we propose a novel deep-learning framework that leverages data from diverse population and disentangles ancestry from the phenotype-relevant information in its representation. The ancestry disentangled representation can be used to build risk predictors that perform better across minority populations. We applied the proposed method to the analysis of Alzheimer's disease genetics. Comparing with standard linear and nonlinear risk prediction methods, the proposed method substantially improves risk prediction in minority populations, particularly for admixed individuals.  ( 2 min )
    Real-time Forecasting of Time Series in Financial Markets Using Sequentially Trained Many-to-one LSTMs. (arXiv:2205.04678v1 [cs.LG])
    Financial markets are highly complex and volatile; thus, learning about such markets for the sake of making predictions is vital to make early alerts about crashes and subsequent recoveries. People have been using learning tools from diverse fields such as financial mathematics and machine learning in the attempt of making trustworthy predictions on such markets. However, the accuracy of such techniques had not been adequate until artificial neural network (ANN) frameworks were developed. Moreover, making accurate real-time predictions of financial time series is highly subjective to the ANN architecture in use and the procedure of training it. Long short-term memory (LSTM) is a member of the recurrent neural network family which has been widely utilized for time series predictions. Especially, we train two LSTMs with a known length, say $T$ time steps, of previous data and predict only one time step ahead. At each iteration, while one LSTM is employed to find the best number of epochs, the second LSTM is trained only for the best number of epochs to make predictions. We treat the current prediction as in the training set for the next prediction and train the same LSTM. While classic ways of training result in more error when the predictions are made further away in the test period, our approach is capable of maintaining a superior accuracy as training increases when it proceeds through the testing period. The forecasting accuracy of our approach is validated using three time series from each of the three diverse financial markets: stock, cryptocurrency, and commodity. The results are compared with those of an extended Kalman filter, an autoregressive model, and an autoregressive integrated moving average model.  ( 2 min )
    SuMe: A Dataset Towards Summarizing Biomedical Mechanisms. (arXiv:2205.04652v1 [cs.CL])
    Can language models read biomedical texts and explain the biomedical mechanisms discussed? In this work we introduce a biomedical mechanism summarization task. Biomedical studies often investigate the mechanisms behind how one entity (e.g., a protein or a chemical) affects another in a biological context. The abstracts of these publications often include a focused set of sentences that present relevant supporting statements regarding such relationships, associated experimental evidence, and a concluding sentence that summarizes the mechanism underlying the relationship. We leverage this structure and create a summarization task, where the input is a collection of sentences and the main entities in an abstract, and the output includes the relationship and a sentence that summarizes the mechanism. Using a small amount of manually labeled mechanism sentences, we train a mechanism sentence classifier to filter a large biomedical abstract collection and create a summarization dataset with 22k instances. We also introduce conclusion sentence generation as a pretraining task with 611k instances. We benchmark the performance of large bio-domain language models. We find that while the pretraining task help improves performance, the best model produces acceptable mechanism outputs in only 32% of the instances, which shows the task presents significant challenges in biomedical language understanding and summarization.  ( 2 min )
    On some studies of Fraud Detection Pipeline and related issues from the scope of Ensemble Learning and Graph-based Learning. (arXiv:2205.04626v1 [cs.LG])
    The UK anti-fraud charity Fraud Advisory Panel (FAP) in their review of 2016 estimates business costs of fraud at 144 billion, and its individual counterpart at 9.7 billion. Banking, insurance, manufacturing, and government are the most common industries affected by fraud activities. Designing an efficient fraud detection system could avoid losing the money; however, building this system is challenging due to many difficult problems, e.g.imbalanced data, computing costs, etc. Over the last three decades, there are various research relates to fraud detection but no agreement on what is the best approach to build the fraud detection system. In this thesis, we aim to answer some questions such as i) how to build a simplified and effective Fraud Detection System that not only easy to implement but also providing reliable results and our proposed Fraud Detection Pipeline is a potential backbone of the system and is easy to be extended or upgraded, ii) when to update models in our system (and keep the accuracy stable) in order to reduce the cost of updating process, iii) how to deal with an extreme imbalance in big data classification problem, e.g. fraud detection, since this is the gap between two difficult problems, iv) further, how to apply graph-based semi-supervised learning to detect fraudulent transactions.  ( 2 min )
    Risk Aversion In Learning Algorithms and an Application To Recommendation Systems. (arXiv:2205.04619v1 [cs.LG])
    Consider a bandit learning environment. We demonstrate that popular learning algorithms such as Upper Confidence Band (UCB) and $\varepsilon$-Greedy exhibit risk aversion: when presented with two arms of the same expectation, but different variance, the algorithms tend to not choose the riskier, i.e. higher variance, arm. We prove that $\varepsilon$-Greedy chooses the risky arm with probability tending to $0$ when faced with a deterministic and a Rademacher-distributed arm. We show experimentally that UCB also shows risk-averse behavior, and that risk aversion is present persistently in early rounds of learning even if the riskier arm has a slightly higher expectation. We calibrate our model to a recommendation system and show that algorithmic risk aversion can decrease consumer surplus and increase homogeneity. We discuss several extensions to other bandit algorithms, reinforcement learning, and investigate the impacts of algorithmic risk aversion for decision theory.  ( 2 min )
    Image2Gif: Generating Continuous Realistic Animations with Warping NODEs. (arXiv:2205.04519v1 [cs.CV])
    Generating smooth animations from a limited number of sequential observations has a number of applications in vision. For example, it can be used to increase number of frames per second, or generating a new trajectory only based on first and last frames, e.g. a motion of face emotions. Despite the discrete observed data (frames), the problem of generating a new trajectory is a continues problem. In addition, to be perceptually realistic, the domain of an image should not alter drastically through the trajectory of changes. In this paper, we propose a new framework, Warping Neural ODE, for generating a smooth animation (video frame interpolation) in a continuous manner, given two ("farther apart") frames, denoting the start and the end of the animation. The key feature of our framework is utilizing the continuous spatial transformation of the image based on the vector field, derived from a system of differential equations. This allows us to achieve the smoothness and the realism of an animation with infinitely small time steps between the frames. We show the application of our work in generating an animation given two frames, in different training settings, including Generative Adversarial Network (GAN) and with $L_2$ loss.  ( 2 min )
    How Does Frequency Bias Affect the Robustness of Neural Image Classifiers against Common Corruption and Adversarial Perturbations?. (arXiv:2205.04533v1 [cs.LG])
    Model robustness is vital for the reliable deployment of machine learning models in real-world applications. Recent studies have shown that data augmentation can result in model over-relying on features in the low-frequency domain, sacrificing performance against low-frequency corruptions, highlighting a connection between frequency and robustness. Here, we take one step further to more directly study the frequency bias of a model through the lens of its Jacobians and its implication to model robustness. To achieve this, we propose Jacobian frequency regularization for models' Jacobians to have a larger ratio of low-frequency components. Through experiments on four image datasets, we show that biasing classifiers towards low (high)-frequency components can bring performance gain against high (low)-frequency corruption and adversarial perturbation, albeit with a tradeoff in performance for low (high)-frequency corruption. Our approach elucidates a more direct connection between the frequency bias and robustness of deep learning models.  ( 2 min )
    Nightly Automobile Claims Prediction from Telematics-Derived Features: A Multilevel Approach. (arXiv:2205.04616v1 [cs.LG])
    In recent years it has become possible to collect GPS data from drivers and to incorporate this data into automobile insurance pricing for the driver. This data is continuously collected and processed nightly into metadata consisting of mileage and time summaries of each discrete trip taken, and a set of behavioral scores describing attributes of the trip (e.g, driver fatigue or driver distraction) so we examine whether it can be used to identify periods of increased risk by successfully classifying trips that occur immediately before a trip in which there was an incident leading to a claim for that driver. Identification of periods of increased risk for a driver is valuable because it creates an opportunity for intervention and, potentially, avoidance of a claim. We examine metadata for each trip a driver takes and train a classifier to predict whether \textit{the following trip} is one in which a claim occurs for that driver. By achieving a area under the receiver-operator characteristic above 0.6, we show that it is possible to predict claims in advance. Additionally, we compare the predictive power, as measured by the area under the receiver-operator characteristic of XGBoost classifiers trained to predict whether a driver will have a claim using exposure features such as driven miles, and those trained using behavioral features such as a computed speed score.  ( 2 min )
    A Song of (Dis)agreement: Evaluating the Evaluation of Explainable Artificial Intelligence in Natural Language Processing. (arXiv:2205.04559v1 [cs.CL])
    There has been significant debate in the NLP community about whether or not attention weights can be used as an explanation - a mechanism for interpreting how important each input token is for a particular prediction. The validity of "attention as explanation" has so far been evaluated by computing the rank correlation between attention-based explanations and existing feature attribution explanations using LSTM-based models. In our work, we (i) compare the rank correlation between five more recent feature attribution methods and two attention-based methods, on two types of NLP tasks, and (ii) extend this analysis to also include transformer-based models. We find that attention-based explanations do not correlate strongly with any recent feature attribution methods, regardless of the model or task. Furthermore, we find that none of the tested explanations correlate strongly with one another for the transformer-based model, leading us to question the underlying assumption that we should measure the validity of attention-based explanations based on how well they correlate with existing feature attribution explanation methods. After conducting experiments on five datasets using two different models, we argue that the community should stop using rank correlation as an evaluation metric for attention-based explanations. We suggest that researchers and practitioners should instead test various explanation methods and employ a human-in-the-loop process to determine if the explanations align with human intuition for the particular use case at hand.  ( 2 min )
    Sentence-level Privacy for Document Embeddings. (arXiv:2205.04605v1 [cs.LG])
    User language data can contain highly sensitive personal content. As such, it is imperative to offer users a strong and interpretable privacy guarantee when learning from their data. In this work, we propose SentDP: pure local differential privacy at the sentence level for a single user document. We propose a novel technique, DeepCandidate, that combines concepts from robust statistics and language modeling to produce high-dimensional, general-purpose $\epsilon$-SentDP document embeddings. This guarantees that any single sentence in a document can be substituted with any other sentence while keeping the embedding $\epsilon$-indistinguishable. Our experiments indicate that these private document embeddings are useful for downstream tasks like sentiment analysis and topic classification and even outperform baseline methods with weaker guarantees like word-level Metric DP.  ( 2 min )
    Towards Optimal VPU Compiler Cost Modeling by using Neural Networks to Infer Hardware Performances. (arXiv:2205.04586v1 [cs.LG])
    Calculating the most efficient schedule of work in a neural network compiler is a difficult task. There are many parameters to be accounted for that can positively or adversely affect that schedule depending on their configuration - How work is shared between distributed targets, the subdivision of tensors to fit in memory, toggling the enablement of optimizations, etc. Traditionally, neural network compilers determine how to set these values by building a graph of choices and choosing the path with minimal 'cost'. These choices and their corresponding costs are usually determined by an algorithm crafted by engineers with a deep knowledge of the target platform. However, when the amount of options available to a compiler is large, it is very difficult to ensure that these models consistently produce an optimal schedule for all scenarios, whilst still completing compilation in an acceptable timeframe. This paper presents 'VPUNN' - a neural network-based cost model trained on low-level task profiling that consistently outperforms the state-of-the-art cost modeling in Intel's line of VPU processors.  ( 2 min )
    KEMP: Keyframe-Based Hierarchical End-to-End Deep Model for Long-Term Trajectory Prediction. (arXiv:2205.04624v1 [cs.CV])
    Predicting future trajectories of road agents is a critical task for autonomous driving. Recent goal-based trajectory prediction methods, such as DenseTNT and PECNet, have shown good performance on prediction tasks on public datasets. However, they usually require complicated goal-selection algorithms and optimization. In this work, we propose KEMP, a hierarchical end-to-end deep learning framework for trajectory prediction. At the core of our framework is keyframe-based trajectory prediction, where keyframes are representative states that trace out the general direction of the trajectory. KEMP first predicts keyframes conditioned on the road context, and then fills in intermediate states conditioned on the keyframes and the road context. Under our general framework, goal-conditioned methods are special cases in which the number of keyframes equal to one. Unlike goal-conditioned methods, our keyframe predictor is learned automatically and does not require hand-crafted goal-selection algorithms. We evaluate our model on public benchmarks and our model ranked 1st on Waymo Open Motion Dataset Leaderboard (as of September 1, 2021).  ( 2 min )
    Calibrating for Class Weights by Modeling Machine Learning. (arXiv:2205.04613v1 [cs.LG])
    A much studied issue is the extent to which the confidence scores provided by machine learning algorithms are calibrated to ground truth probabilities. Our starting point is that calibration is seemingly incompatible with class weighting, a technique often employed when one class is less common (class imbalance) or with the hope of achieving some external objective (cost-sensitive learning). We provide a model-based explanation for this incompatibility and use our anthropomorphic model to generate a simple method of recovering likelihoods from an algorithm that is miscalibrated due to class weighting. We validate this approach in the binary pneumonia detection task of Rajpurkar, Irvin, Zhu, et al. (2017).  ( 2 min )
    Towards Intersectionality in Machine Learning: Including More Identities, Handling Underrepresentation, and Performing Evaluation. (arXiv:2205.04610v1 [cs.LG])
    Research in machine learning fairness has historically considered a single binary demographic attribute; however, the reality is of course far more complicated. In this work, we grapple with questions that arise along three stages of the machine learning pipeline when incorporating intersectionality as multiple demographic attributes: (1) which demographic attributes to include as dataset labels, (2) how to handle the progressively smaller size of subgroups during model training, and (3) how to move beyond existing evaluation metrics when benchmarking model fairness for more subgroups. For each question, we provide thorough empirical evaluation on tabular datasets derived from the US Census, and present constructive recommendations for the machine learning community. First, we advocate for supplementing domain knowledge with empirical validation when choosing which demographic attribute labels to train on, while always evaluating on the full set of demographic attributes. Second, we warn against using data imbalance techniques without considering their normative implications and suggest an alternative using the structure in the data. Third, we introduce new evaluation metrics which are more appropriate for the intersectional setting. Overall, we provide substantive suggestions on three necessary (albeit not sufficient!) considerations when incorporating intersectionality into machine learning.  ( 2 min )
    Rethinking Fairness: An Interdisciplinary Survey of Critiques of Hegemonic ML Fairness Approaches. (arXiv:2205.04460v1 [cs.LG])
    This survey article assesses and compares existing critiques of current fairness-enhancing technical interventions into machine learning (ML) that draw from a range of non-computing disciplines, including philosophy, feminist studies, critical race and ethnic studies, legal studies, anthropology, and science and technology studies. It bridges epistemic divides in order to offer an interdisciplinary understanding of the possibilities and limits of hegemonic computational approaches to ML fairness for producing just outcomes for society's most marginalized. The article is organized according to nine major themes of critique wherein these different fields intersect: 1) how "fairness" in AI fairness research gets defined; 2) how problems for AI systems to address get formulated; 3) the impacts of abstraction on how AI tools function and its propensity to lead to technological solutionism; 4) how racial classification operates within AI fairness research; 5) the use of AI fairness measures to avoid regulation and engage in ethics washing; 6) an absence of participatory design and democratic deliberation in AI fairness considerations; 7) data collection practices that entrench "bias," are non-consensual, and lack transparency; 8) the predatory inclusion of marginalized groups into AI systems; and 9) a lack of engagement with AI's long-term social and ethical outcomes. Drawing from these critiques, the article concludes by imagining future ML fairness research directions that actively disrupt entrenched power dynamics and structural injustices in society.  ( 2 min )
    A Probabilistic Generative Model of Free Categories. (arXiv:2205.04545v1 [cs.AI])
    Applied category theory has recently developed libraries for computing with morphisms in interesting categories, while machine learning has developed ways of learning programs in interesting languages. Taking the analogy between categories and languages seriously, this paper defines a probabilistic generative model of morphisms in free monoidal categories over domain-specific generating objects and morphisms. The paper shows how acyclic directed wiring diagrams can model specifications for morphisms, which the model can use to generate morphisms. Amortized variational inference in the generative model then enables learning of parameters (by maximum likelihood) and inference of latent variables (by Bayesian inversion). A concrete experiment shows that the free category prior achieves competitive reconstruction performance on the Omniglot dataset.  ( 2 min )
    Towards a multi-stakeholder value-based assessment framework for algorithmic systems. (arXiv:2205.04525v1 [cs.LG])
    In an effort to regulate Machine Learning-driven (ML) systems, current auditing processes mostly focus on detecting harmful algorithmic biases. While these strategies have proven to be impactful, some values outlined in documents dealing with ethics in ML-driven systems are still underrepresented in auditing processes. Such unaddressed values mainly deal with contextual factors that cannot be easily quantified. In this paper, we develop a value-based assessment framework that is not limited to bias auditing and that covers prominent ethical principles for algorithmic systems. Our framework presents a circular arrangement of values with two bipolar dimensions that make common motivations and potential tensions explicit. In order to operationalize these high-level principles, values are then broken down into specific criteria and their manifestations. However, some of these value-specific criteria are mutually exclusive and require negotiation. As opposed to some other auditing frameworks that merely rely on ML researchers' and practitioners' input, we argue that it is necessary to include stakeholders that present diverse standpoints to systematically negotiate and consolidate value and criteria tensions. To that end, we map stakeholders with different insight needs, and assign tailored means for communicating value manifestations to them. We, therefore, contribute to current ML auditing practices with an assessment framework that visualizes closeness and tensions between values and we give guidelines on how to operationalize them, while opening up the evaluation and deliberation process to a wide range of stakeholders.  ( 2 min )
    Selectively Contextual Bandits. (arXiv:2205.04528v1 [cs.LG])
    Contextual bandits are widely used in industrial personalization systems. These online learning frameworks learn a treatment assignment policy in the presence of treatment effects that vary with the observed contextual features of the users. While personalization creates a rich user experience that reflect individual interests, there are benefits of a shared experience across a community that enable participation in the zeitgeist. Such benefits are emergent through network effects and are not captured in regret metrics typically employed in evaluating bandits. To balance these needs, we propose a new online learning algorithm that preserves benefits of personalization while increasing the commonality in treatments across users. Our approach selectively interpolates between a contextual bandit algorithm and a context-free multi-arm bandit and leverages the contextual information for a treatment decision only if it promises significant gains. Apart from helping users of personalization systems balance their experience between the individualized and shared, simplifying the treatment assignment policy by making it selectively reliant on the context can help improve the rate of learning in some cases. We evaluate our approach in a classification setting using public datasets and show the benefits of the hybrid policy.  ( 2 min )
    PinnerFormer: Sequence Modeling for User Representation at Pinterest. (arXiv:2205.04507v1 [cs.LG])
    Sequential models have become increasingly popular in powering personalized recommendation systems over the past several years. These approaches traditionally model a user's actions on a website as a sequence to predict the user's next action. While theoretically simplistic, these models are quite challenging to deploy in production, commonly requiring streaming infrastructure to reflect the latest user activity and potentially managing mutable data for encoding a user's hidden state. Here we introduce PinnerFormer, a user representation trained to predict a user's future long-term engagement using a sequential model of a user's recent actions. Unlike prior approaches, we adapt our modeling to a batch infrastructure via our new dense all-action loss, modeling long-term future actions instead of next action prediction. We show that by doing so, we significantly close the gap between batch user embeddings that are generated once a day and realtime user embeddings generated whenever a user takes an action. We describe our design decisions via extensive offline experimentation and ablations and validate the efficacy of our approach in A/B experiments showing substantial improvements in Pinterest's user retention and engagement when comparing PinnerFormer against our previous user representation. PinnerFormer is deployed in production as of Fall 2021.  ( 2 min )
    Insights into the origin of halo mass profiles from machine learning. (arXiv:2205.04474v1 [astro-ph.CO])
    The mass distribution of dark matter haloes is the result of the hierarchical growth of initial density perturbations through mass accretion and mergers. We use an interpretable machine-learning framework to provide physical insights into the origin of the spherically-averaged mass profile of dark matter haloes. We train a gradient-boosted-trees algorithm to predict the final mass profiles of cluster-sized haloes, and measure the importance of the different inputs provided to the algorithm. We find two primary scales in the initial conditions (ICs) that impact the final mass profile: the density at approximately the scale of the haloes' Lagrangian patch $R_L$ ($R\sim 0.7\, R_L$) and that in the large-scale environment ($R\sim 1.7~R_L$). The model also identifies three primary time-scales in the halo assembly history that affect the final profile: (i) the formation time of the virialized, collapsed material inside the halo, (ii) the dynamical time, which captures the dynamically unrelaxed, infalling component of the halo over its first orbit, (iii) a third, most recent time-scale, which captures the impact on the outer profile of recent massive merger events. While the inner profile retains memory of the ICs, this information alone is insufficient to yield accurate predictions for the outer profile. As we add information about the haloes' mass accretion history, we find a significant improvement in the predicted profiles at all radii. Our machine-learning framework provides novel insights into the role of the ICs and the mass assembly history in determining the final mass profile of cluster-sized haloes.  ( 2 min )
    OTFPF: Optimal Transport-Based Feature Pyramid Fusion Network for Brain Age Estimation with 3D Overlapped ConvNeXt. (arXiv:2205.04684v1 [cs.CV])
    Chronological age of healthy brain is able to be predicted using deep neural networks from T1-weighted magnetic resonance images (T1 MRIs), and the predicted brain age could serve as an effective biomarker for detecting aging-related diseases or disorders. In this paper, we propose an end-to-end neural network architecture, referred to as optimal transport based feature pyramid fusion (OTFPF) network, for the brain age estimation with T1 MRIs. The OTFPF consists of three types of modules: Optimal Transport based Feature Pyramid Fusion (OTFPF) module, 3D overlapped ConvNeXt (3D OL-ConvNeXt) module and fusion module. These modules strengthen the OTFPF network's understanding of each brain's semi-multimodal and multi-level feature pyramid information, and significantly improve its estimation performances. Comparing with recent state-of-the-art models, the proposed OTFPF converges faster and performs better. The experiments with 11,728 MRIs aged 3-97 years show that OTFPF network could provide accurate brain age estimation, yielding mean absolute error (MAE) of 2.097, Pearson's correlation coefficient (PCC) of 0.993 and Spearman's rank correlation coefficient (SRCC) of 0.989, between the estimated and chronological ages. Widespread quantitative experiments and ablation experiments demonstrate the superiority and rationality of OTFPF network. The codes and implement details will be released on GitHub: https://github.com/ZJU-Brain/OTFPF after final decision.  ( 2 min )
    Variational Inference MPC using Normalizing Flows and Out-of-Distribution Projection. (arXiv:2205.04667v1 [cs.RO])
    We propose a Model Predictive Control (MPC) method for collision-free navigation that uses amortized variational inference to approximate the distribution of optimal control sequences by training a normalizing flow conditioned on the start, goal and environment. This representation allows us to learn a distribution that accounts for both the dynamics of the robot and complex obstacle geometries. We can then sample from this distribution to produce control sequences which are likely to be both goal-directed and collision-free as part of our proposed FlowMPPI sampling-based MPC method. However, when deploying this method, the robot may encounter an out-of-distribution (OOD) environment, i.e. one which is radically different from those used in training. In such cases, the learned flow cannot be trusted to produce low-cost control sequences. To generalize our method to OOD environments we also present an approach that performs projection on the representation of the environment as part of the MPC process. This projection changes the environment representation to be more in-distribution while also optimizing trajectory quality in the true environment. Our simulation results on a 2D double-integrator and a 3D 12DoF underactuated quadrotor suggest that FlowMPPI with projection outperforms state-of-the-art MPC baselines on both in-distribution and OOD environments, including OOD environments generated from real-world data.  ( 2 min )
    Crypto Pump and Dump via Deep Learning Techniques. (arXiv:2205.04646v1 [cs.LG])
    Despite the fact that cryptocurrencies themselves have experienced an astonishing rate of adoption over the last decade, cryptocurrency fraud detection is a heavily under-researched problem area. Of all fraudulent activity regarding cryptocurrencies, pump and dump schemes are some of the most common. Though some studies have been done on these kinds of scams in the stock market, the lack of labelled stock data and the volatility unique to the cryptocurrency space constrains the applicability of studies on the stock market toward this problem domain. Furthermore, the only work done in this space thus far has been either statistical in nature, or has been concerned with classical machine learning models such as random forest trees. We propose the novel application of two existing neural network architectures to this problem domain and show that deep learning solutions can significantly outperform all other existing pump and dump detection methods for cryptocurrencies.  ( 2 min )
    A 14uJ/Decision Keyword Spotting Accelerator with In-SRAM-Computing and On Chip Learning for Customization. (arXiv:2205.04665v1 [cs.AR])
    Keyword spotting has gained popularity as a natural way to interact with consumer devices in recent years. However, because of its always-on nature and the variety of speech, it necessitates a low-power design as well as user customization. This paper describes a low-power, energy-efficient keyword spotting accelerator with SRAM based in-memory computing (IMC) and on-chip learning for user customization. However, IMC is constrained by macro size, limited precision, and non-ideal effects. To address the issues mentioned above, this paper proposes bias compensation and fine-tuning using an IMC-aware model design. Furthermore, because learning with low-precision edge devices results in zero error and gradient values due to quantization, this paper proposes error scaling and small gradient accumulation to achieve the same accuracy as ideal model training. The simulation results show that with user customization, we can recover the accuracy loss from 51.08\% to 89.76\% with compensation and fine-tuning and further improve to 96.71\% with customization. The chip implementation can successfully run the model with only 14$uJ$ per decision. When compared to the state-of-the-art works, the presented design has higher energy efficiency with additional on-chip model customization capabilities for higher accuracy.  ( 2 min )
    An Edge-Cloud Integrated Framework for Flexible and Dynamic Stream Analytics. (arXiv:2205.04622v1 [cs.DC])
    With the popularity of Internet of Things (IoT), edge computing and cloud computing, more and more stream analytics applications are being developed including real-time trend prediction and object detection on top of IoT sensing data. One popular type of stream analytics is the recurrent neural network (RNN) deep learning model based time series or sequence data prediction and forecasting. Different from traditional analytics that assumes data to be processed are available ahead of time and will not change, stream analytics deals with data that are being generated continuously and data trend/distribution could change (aka concept drift), which will cause prediction/forecasting accuracy to drop over time. One other challenge is to find the best resource provisioning for stream analytics to achieve good overall latency. In this paper, we study how to best leverage edge and cloud resources to achieve better accuracy and latency for RNN-based stream analytics. We propose a novel edge-cloud integrated framework for hybrid stream analytics that support low latency inference on the edge and high capacity training on the cloud. We study the flexible deployment of our hybrid learning framework, namely edge-centric, cloud-centric and edge-cloud integrated. Further, our hybrid learning framework can dynamically combine inference results from an RNN model pre-trained based on historical data and another RNN model re-trained periodically based on the most recent data. Using real-world and simulated stream datasets, our experiments show the proposed edge-cloud deployment is the best among all three deployment types in terms of latency. For accuracy, the experiments show our dynamic learning approach performs the best among all learning approaches for all three concept drift scenarios.  ( 2 min )
    Robust Learning of Parsimonious Deep Neural Networks. (arXiv:2205.04650v1 [cs.LG])
    We propose a simultaneous learning and pruning algorithm capable of identifying and eliminating irrelevant structures in a neural network during the early stages of training. Thus, the computational cost of subsequent training iterations, besides that of inference, is considerably reduced. Our method, based on variational inference principles, learns the posterior distribution of Bernoulli random variables multiplying the units/filters similarly to adaptive dropout. We derive a novel hyper-prior distribution over the prior parameters that is crucial for their optimal selection in a way that the Bernoulli parameters practically converge to either 0 or 1 establishing a deterministic final network. Our algorithm is robust in the sense that it achieves consistent pruning levels and prediction accuracy regardless of weight initialization or the size of the starting network. We provide an analysis of its convergence properties establishing theoretical and practical pruning conditions. We evaluate the proposed algorithm on the MNIST data set and commonly used fully connected and convolutional LeNet architectures. The simulations show that our method achieves pruning levels on par with state-of the-art methods for structured pruning, while maintaining better test-accuracy and more importantly in a manner robust with respect to network initialization and initial size.  ( 2 min )
    Real-Time Wearable Gait Phase Segmentation For Running And Walking. (arXiv:2205.04668v1 [cs.LG])
    Previous gait phase detection as convolutional neural network (CNN) based classification task requires cumbersome manual setting of time delay or heavy overlapped sliding windows to accurately classify each phase under different test cases, which is not suitable for streaming Inertial-Measurement-Unit (IMU) sensor data and fails to adapt to different scenarios. This paper presents a segmentation based gait phase detection with only a single six-axis IMU sensor, which can easily adapt to both walking and running at various speeds. The proposed segmentation uses CNN with gait phase aware receptive field setting and IMU oriented processing order, which can fit to high sampling rate of IMU up to 1000Hz for high accuracy and low sampling rate down to 20Hz for real time calculation. The proposed model on the 20Hz sampling rate data can achieve average error of 8.86 ms in swing time, 9.12 ms in stance time and 96.44\% accuracy of gait phase detection and 99.97\% accuracy of stride detection. Its real-time implementation on mobile phone only takes 36 ms for 1 second length of sensor data.  ( 2 min )
    Deep Gait Tracking With Inertial Measurement Unit. (arXiv:2205.04666v1 [cs.LG])
    This paper presents a convolutional neural network based foot motion tracking with only six-axis Inertial-Measurement-Unit (IMU) sensor data. The presented approach can adapt to various walking conditions by adopting differential and window based input. The training data are further augmented by sliding and random window samplings on IMU sensor data to increase data diversity for better performance. The proposed approach fuses predictions of three dimensional output into one model. The proposed fused model can achieve average error of 2.30+-2.23 cm in X-axis, 0.91+-0.95 cm in Y-axis and 0.58+-0.52 cm in Z-axis.  ( 2 min )
  • Open

    Flexible variable selection in the presence of missing data. (arXiv:2202.12989v2 [stat.ME] UPDATED)
    In many applications, it is of interest to identify a parsimonious set of features, or panel, from multiple candidates that achieves a desired level of performance in predicting a response. This task is often complicated in practice by missing data arising from the sampling design or other random mechanisms. Most recent work on variable selection in missing data contexts relies in some part on a finite-dimensional statistical model, e.g., a generalized or penalized linear model. In cases where this model is misspecified, the selected variables may not all be truly scientifically relevant and can result in panels with suboptimal classification performance. To address this limitation, we propose several nonparametric variable selection algorithms combined with multiple imputation to develop flexible panels in the presence of missing-at-random data. We outline strategies based on the proposed algorithms that achieve control of commonly used error rates. Through simulations, we show that our proposals have good operating characteristics and result in panels with higher classification performance compared to several existing penalized regression approaches in cases where a generalized linear model is misspecified. Finally, we use the proposed methods to develop biomarker panels for separating pancreatic cysts with differing malignancy potential in a setting where complicated missingness in the biomarkers arose due to limited specimen volumes.  ( 2 min )
    Neural Collapse Under MSE Loss: Proximity to and Dynamics on the Central Path. (arXiv:2106.02073v4 [cs.LG] UPDATED)
    The recently discovered Neural Collapse (NC) phenomenon occurs pervasively in today's deep net training paradigm of driving cross-entropy (CE) loss towards zero. During NC, last-layer features collapse to their class-means, both classifiers and class-means collapse to the same Simplex Equiangular Tight Frame, and classifier behavior collapses to the nearest-class-mean decision rule. Recent works demonstrated that deep nets trained with mean squared error (MSE) loss perform comparably to those trained with CE. As a preliminary, we empirically establish that NC emerges in such MSE-trained deep nets as well through experiments on three canonical networks and five benchmark datasets. We provide, in a Google Colab notebook, PyTorch code for reproducing MSE-NC and CE-NC: at https://colab.research.google.com/github/neuralcollapse/neuralcollapse/blob/main/neuralcollapse.ipynb. The analytically-tractable MSE loss offers more mathematical opportunities than the hard-to-analyze CE loss, inspiring us to leverage MSE loss towards the theoretical investigation of NC. We develop three main contributions: (I) We show a new decomposition of the MSE loss into (A) terms directly interpretable through the lens of NC and which assume the last-layer classifier is exactly the least-squares classifier; and (B) a term capturing the deviation from this least-squares classifier. (II) We exhibit experiments on canonical datasets and networks demonstrating that term-(B) is negligible during training. This motivates us to introduce a new theoretical construct: the central path, where the linear classifier stays MSE-optimal for feature activations throughout the dynamics. (III) By studying renormalized gradient flow along the central path, we derive exact dynamics that predict NC.  ( 3 min )
    Nested conformal prediction and quantile out-of-bag ensemble methods. (arXiv:1910.10562v4 [stat.ME] UPDATED)
    Conformal prediction is a popular tool for providing valid prediction sets for classification and regression problems, without relying on any distributional assumptions on the data. While the traditional description of conformal prediction starts with a nonconformity score, we provide an alternate (but equivalent) view that starts with a sequence of nested sets and calibrates them to find a valid prediction set. The nested framework subsumes all nonconformity scores, including recent proposals based on quantile regression and density estimation. While these ideas were originally derived based on sample splitting, our framework seamlessly extends them to other aggregation schemes like cross-conformal, jackknife+ and out-of-bag methods. We use the framework to derive a new algorithm (QOOB, pronounced cube) that combines four ideas: quantile regression, cross-conformalization, ensemble methods and out-of-bag predictions. We develop a computationally efficient implementation of cross-conformal, that is also used by QOOB. In a detailed numerical investigation, QOOB performs either the best or close to the best on all simulated and real datasets. Code for QOOB is available at https://github.com/aigen/QOOB.  ( 2 min )
    Adaptation Strategies for Automated Machine Learning on Evolving Data. (arXiv:2006.06480v3 [cs.LG] UPDATED)
    Automated Machine Learning (AutoML) systems have been shown to efficiently build good models for new datasets. However, it is often not clear how well they can adapt when the data evolves over time. The main goal of this study is to understand the effect of data stream challenges such as concept drift on the performance of AutoML methods, and which adaptation strategies can be employed to make them more robust. To that end, we propose 6 concept drift adaptation strategies and evaluate their effectiveness on different AutoML approaches. We do this for a variety of AutoML approaches for building machine learning pipelines, including those that leverage Bayesian optimization, genetic programming, and random search with automated stacking. These are evaluated empirically on real-world and synthetic data streams with different types of concept drift. Based on this analysis, we propose ways to develop more sophisticated and robust AutoML techniques.  ( 2 min )
    Mean Estimation from One-Bit Measurements. (arXiv:1901.03403v4 [cs.IT] UPDATED)
    We consider the problem of estimating the mean of a symmetric log-concave distribution under the constraint that only a single bit per sample from this distribution is available to the estimator. We study the mean squared error as a function of the sample size (and hence the number of bits). We consider three settings: first, a centralized setting, where an encoder may release $n$ bits given a sample of size $n$, and for which there is no asymptotic penalty for quantization; second, an adaptive setting in which each bit is a function of the current observation and previously recorded bits, where we show that the optimal relative efficiency compared to the sample mean is precisely the efficiency of the median; lastly, we show that in a distributed setting where each bit is only a function of a local sample, no estimator can achieve optimal efficiency uniformly over the parameter space. We additionally complement our results in the adaptive setting by showing that \emph{one} round of adaptivity is sufficient to achieve optimal mean-square error.  ( 2 min )
    A High Throughput Generative Vector Autoregression Model for Stochastic Synapses. (arXiv:2205.05053v1 [cs.NE])
    By imitating the synaptic connectivity and plasticity of the brain, emerging electronic nanodevices offer new opportunities as the building blocks of neuromorphic systems. One challenge for largescale simulations of computational architectures based on emerging devices is to accurately capture device response, hysteresis, noise, and the covariance structure in the temporal domain as well as between the different device parameters. We address this challenge with a high throughput generative model for synaptic arrays that is based on a recently available type of electrical measurement data for resistive memory cells. We map this real world data onto a vector autoregressive stochastic process to accurately reproduce the device parameters and their cross-correlation structure. While closely matching the measured data, our model is still very fast; we provide parallelized implementations for both CPUs and GPUs and demonstrate array sizes above one billion cells and throughputs exceeding one hundred million weight updates per second, above the pixel rate of a 30 frames/s 4K video stream.  ( 2 min )
    Augmenting Physical Models with Deep Networks for Complex Dynamics Forecasting. (arXiv:2010.04456v6 [stat.ML] UPDATED)
    Forecasting complex dynamical phenomena in settings where only partial knowledge of their dynamics is available is a prevalent problem across various scientific fields. While purely data-driven approaches are arguably insufficient in this context, standard physical modeling based approaches tend to be over-simplistic, inducing non-negligible errors. In this work, we introduce the APHYNITY framework, a principled approach for augmenting incomplete physical dynamics described by differential equations with deep data-driven models. It consists in decomposing the dynamics into two components: a physical component accounting for the dynamics for which we have some prior knowledge, and a data-driven component accounting for errors of the physical model. The learning problem is carefully formulated such that the physical model explains as much of the data as possible, while the data-driven component only describes information that cannot be captured by the physical model, no more, no less. This not only provides the existence and uniqueness for this decomposition, but also ensures interpretability and benefits generalization. Experiments made on three important use cases, each representative of a different family of phenomena, i.e. reaction-diffusion equations, wave equations and the non-linear damped pendulum, show that APHYNITY can efficiently leverage approximate physical models to accurately forecast the evolution of the system and correctly identify relevant physical parameters. Code is available at https://github.com/yuan-yin/APHYNITY .  ( 3 min )
    A Robust and Flexible EM Algorithm for Mixtures of Elliptical Distributions with Missing Data. (arXiv:2201.12020v2 [stat.ML] UPDATED)
    This paper tackles the problem of missing data imputation for noisy and non-Gaussian data. A classical imputation method, the Expectation Maximization (EM) algorithm for Gaussian mixture models, has shown interesting properties when compared to other popular approaches such as those based on k-nearest neighbors or on multiple imputations by chained equations. However, Gaussian mixture models are known to be non-robust to heterogeneous data, which can lead to poor estimation performance when the data is contaminated by outliers or follows non-Gaussian distributions. To overcome this issue, a new EM algorithm is investigated for mixtures of elliptical distributions with the property of handling potential missing data. This paper shows that this problem reduces to the estimation of a mixture of Angular Gaussian distributions under generic assumptions (i.e., each sample is drawn from a mixture of elliptical distributions, which is possibly different for one sample to another). In that case, the complete-data likelihood associated with mixtures of elliptical distributions is well adapted to the EM framework with missing data thanks to its conditional distribution, which is shown to be a multivariate $t$-distribution. Experimental results on synthetic data demonstrate that the proposed algorithm is robust to outliers and can be used with non-Gaussian data. Furthermore, experiments conducted on real-world datasets show that this algorithm is very competitive when compared to other classical imputation methods.  ( 2 min )
    Nearly Minimax Optimal Regret for Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation. (arXiv:2102.07301v2 [cs.LG] UPDATED)
    We study reinforcement learning in an infinite-horizon average-reward setting with linear function approximation, where the transition probability function of the underlying Markov Decision Process (MDP) admits a linear form over a feature mapping of the current state, action, and next state. We propose a new algorithm UCRL2-VTR, which can be seen as an extension of the UCRL2 algorithm with linear function approximation. We show that UCRL2-VTR with Bernstein-type bonus can achieve a regret of $\tilde{O}(d\sqrt{DT})$, where $d$ is the dimension of the feature mapping, $T$ is the horizon, and $\sqrt{D}$ is the diameter of the MDP. We also prove a matching lower bound $\tilde{\Omega}(d\sqrt{DT})$, which suggests that the proposed UCRL2-VTR is minimax optimal up to logarithmic factors. To the best of our knowledge, our algorithm is the first nearly minimax optimal RL algorithm with function approximation in the infinite-horizon average-reward setting.  ( 2 min )
    Modeling Regime Shifts in Multiple Time Series. (arXiv:2109.09692v3 [cs.LG] UPDATED)
    We investigate the problem of discovering and modeling regime shifts in an ecosystem comprising multiple time series known as co-evolving time series. Regime shifts refer to the changing behaviors exhibited by series at different time intervals. Learning these changing behaviors is a key step toward time series forecasting. While advances have been made, existing methods suffer from one or more of the following shortcomings: (1) failure to take relationships between time series into consideration for discovering regimes in multiple time series; (2) lack of an effective approach that models time-dependent behaviors exhibited by series; (3) difficulties in handling data discontinuities which may be informative. Most of the existing methods are unable to handle all of these three issues in a unified framework. This, therefore, motivates our effort to devise a principled approach for modeling interactions and time-dependency in co-evolving time series. Specifically, we model an ecosystem of multiple time series by summarizing the heavy ensemble of time series into a lighter and more meaningful structure called a \textit{mapping grid}. By using the mapping grid, our model first learns time series behavioral dependencies through a dynamic network representation, then learns the regime transition mechanism via a full time-dependent Cox regression model. The originality of our approach lies in modeling interactions between time series in regime identification and in modeling time-dependent regime transition probabilities, usually assumed to be static in existing work.  ( 2 min )
    On the power of conditional independence testing under model-X. (arXiv:2005.05506v4 [math.ST] UPDATED)
    For testing conditional independence (CI) of a response Y and a predictor X given covariates Z, the recently introduced model-X (MX) framework has been the subject of active methodological research, especially in the context of MX knockoffs and their successful application to genome-wide association studies. In this paper, we study the power of MX CI tests, yielding quantitative explanations for empirically observed phenomena and novel insights to guide the design of MX methodology. We show that any valid MX CI test must also be valid conditionally on Y and Z; this conditioning allows us to reformulate the problem as testing a point null hypothesis involving the conditional distribution of X. The Neyman-Pearson lemma then implies that the conditional randomization test (CRT) based on a likelihood statistic is the most powerful MX CI test against a point alternative. We also obtain a related optimality result for MX knockoffs. Switching to an asymptotic framework with arbitrarily growing covariate dimension, we derive an expression for the limiting power of the CRT against local semiparametric alternatives in terms of the prediction error of the machine learning algorithm on which its test statistic is based. Finally, we exhibit a resampling-free test with uniform asymptotic Type-I error control under the assumption that only the first two moments of X given Z are known, a significant relaxation of the MX assumption.  ( 2 min )
    Gradient flows on graphons: existence, convergence, continuity equations. (arXiv:2111.09459v2 [math.PR] UPDATED)
    Wasserstein gradient flows on probability measures have found a host of applications in various optimization problems. They typically arise as the continuum limit of exchangeable particle systems evolving by some mean-field interaction involving a gradient-type potential. However, in many problems, such as in multi-layer neural networks, the so-called particles are edge weights on large graphs whose nodes are exchangeable. Such large graphs are known to converge to continuum limits called graphons as their size grow to infinity. We show that the Euclidean gradient flow of a suitable function of the edge-weights converges to a novel continuum limit given by a curve on the space of graphons that can be appropriately described as a gradient flow or, more technically, a curve of maximal slope. Several natural functions on graphons, such as homomorphism functions and the scalar entropy, are covered by our set-up, and the examples have been worked out in detail.  ( 2 min )
    Representing Hierarchical Structure by Using Cone Embedding. (arXiv:2102.08014v2 [cs.AI] UPDATED)
    Graph embedding is becoming an important method with applications in various areas, including social networks and knowledge graph completion. In particular, Poincar\'e embedding has been proposed to capture the hierarchical structure of graphs, and its effectiveness has been reported. However, most of the existing methods have isometric mappings in the embedding space, and the choice of the origin point can be arbitrary. This fact is not desirable when the distance from the origin is used as an indicator of hierarchy, as in the case of Poincar\'e embedding. In this paper, we propose cone embedding, embedding method in a metric cone, which solve these problems, and we gain further benefits: 1) we provide an indicator of hierarchical information that is both geometrically and intuitively natural to interpret, and 2) we can extract the hierarchical structure from a graph embedding output of other methods by learning additional one-dimensional parameters.  ( 2 min )
    Community Detection with a Subsampled Semidefinite Program. (arXiv:2102.01419v3 [math.OC] UPDATED)
    Semidefinite programming is an important tool to tackle several problems in data science and signal processing, including clustering and community detection. However, semidefinite programs are often slow in practice, so speed up techniques such as sketching are often considered. In the context of community detection in the stochastic block model, Mixon and Xie \cite{mixon2020sketching} have recently proposed a sketching framework in which a semidefinite program is solved only on a subsampled subgraph of the network, giving rise to significant computational savings. In this short paper, we provide a positive answer to a conjecture of Mixon and Xie about the statistical limits of this technique for the stochastic block model with two balanced communities.  ( 2 min )
    A Wasserstein distance approach for concentration of empirical risk estimates. (arXiv:1902.10709v4 [math.ST] UPDATED)
    This paper presents a unified approach based on Wasserstein distance to derive concentration bounds for empirical estimates for two broad classes of risk measures defined in the paper. The classes of risk measures introduced include as special cases well known risk measures from the finance literature such as conditional value at risk (CVaR), optimized certainty equivalent risk, spectral risk measures, utility-based shortfall risk, cumulative prospect theory (CPT) value, rank dependent expected utility and distorted risk measures. Two estimation schemes are considered, one for each class of risk measures. One estimation scheme involves applying the risk measure to the empirical distribution function formed from a collection of i.i.d. samples of the random variable (r.v.), while the second scheme involves applying the same procedure to a truncated sample. The bounds provided apply to three popular classes of distributions, namely sub-Gaussian, sub-exponential and heavy-tailed distributions. The bounds are derived by first relating the estimation error to the Wasserstein distance between the true and empirical distributions, and then using recent concentration bounds for the latter. Previous concentration bounds are available only for specific risk measures such as CVaR and CPT-value. The bounds derived in this paper are shown to either match or improve upon previous bounds in cases where they are available. The usefulness of the bounds is illustrated through an algorithm and the corresponding regret bound for a stochastic bandit problem involving a general risk measure from each of the two classes introduced in the paper.  ( 2 min )
    Differentially Private Learning with Adaptive Clipping. (arXiv:1905.03871v5 [cs.LG] UPDATED)
    Existing approaches for training neural networks with user-level differential privacy (e.g., DP Federated Averaging) in federated learning (FL) settings involve bounding the contribution of each user's model update by clipping it to some constant value. However there is no good a priori setting of the clipping norm across tasks and learning settings: the update norm distribution depends on the model architecture and loss, the amount of data on each device, the client learning rate, and possibly various other parameters. We propose a method wherein instead of a fixed clipping norm, one clips to a value at a specified quantile of the update norm distribution, where the value at the quantile is itself estimated online, with differential privacy. The method tracks the quantile closely, uses a negligible amount of privacy budget, is compatible with other federated learning technologies such as compression and secure aggregation, and has a straightforward joint DP analysis with DP-FedAvg. Experiments demonstrate that adaptive clipping to the median update norm works well across a range of realistic federated learning tasks, sometimes outperforming even the best fixed clip chosen in hindsight, and without the need to tune any clipping hyperparameter.  ( 2 min )
    Random Forests for Change Point Detection. (arXiv:2205.04997v1 [stat.ME])
    We propose a novel multivariate nonparametric multiple change point detection method using classifiers. We construct a classifier log-likelihood ratio that uses class probability predictions to compare different change point configurations. We propose a computationally feasible search method that is particularly well suited for random forests, denoted by changeforest. However, the method can be paired with any classifier that yields class probability predictions, which we illustrate by also using a k-nearest neighbor classifier. We provide theoretical results motivating our choices. In a large simulation study, our proposed changeforest method achieves improved empirical performance compared to existing multivariate nonparametric change point detection methods. An efficient implementation of our method is made available for R, Python, and Rust users in the changeforest software package.  ( 2 min )
    Moving Beyond Sub-Gaussianity in High-Dimensional Statistics: Applications in Covariance Estimation and Linear Regression. (arXiv:1804.02605v4 [math.ST] UPDATED)
    Concentration inequalities form an essential toolkit in the study of high dimensional (HD) statistical methods. Most of the relevant statistics literature in this regard is based on sub-Gaussian or sub-exponential tail assumptions. In this paper, we first bring together various probabilistic inequalities for sums of independent random variables under much more general exponential type (namely sub-Weibull) tail assumptions. These results extract a part sub-Gaussian tail behavior in finite samples, matching the asymptotics governed by the central limit theorem, and are compactly represented in terms of a new Orlicz quasi-norm - the Generalized Bernstein-Orlicz norm - that typifies such tail behaviors. We illustrate the usefulness of these inequalities through the analysis of four fundamental problems in HD statistics. In the first two problems, we study the rate of convergence of the sample covariance matrix in terms of the maximum elementwise norm and the maximum k-sub-matrix operator norm which are key quantities of interest in bootstrap, HD covariance matrix estimation and HD inference. The third example concerns the restricted eigenvalue condition, required in HD linear regression, which we verify for all sub-Weibull random vectors through a unified analysis, and also prove a more general result related to restricted strong convexity in the process. In the final example, we consider the Lasso estimator for linear regression and establish its rate of convergence under much weaker than usual tail assumptions (on the errors as well as the covariates), while also allowing for misspecified models and both fixed and random design. To our knowledge, these are the first such results for Lasso obtained in this generality. The common feature in all our results over all the examples is that the convergence rates under most exponential tails match the usual ones under sub-Gaussian assumptions.  ( 3 min )
    Stabilized Doubly Robust Learning for Recommendation on Data Missing Not at Random. (arXiv:2205.04701v1 [cs.LG])
    In recommender systems, users always choose favorite items to rate, which results in data missing not at random and poses a great challenge for unbiased evaluation and learning of prediction models. Currently, the doubly robust (DR) method and its variants have been widely studied and demonstrate superior performance. However, we show that DR methods are unstable to extremely small propensities and rely on extrapolations, resulting in sub-optimal performances. In this paper, we propose a stabilized doubly robust (SDR) estimator to address the above limitations while retaining double robustness. Theoretical analysis shows that SDR has bounded bias, variance and generalization error bound under inaccurate imputed errors and arbitrarily small propensities. In addition, we propose a novel learning approach for SDR that updates the imputation, propensity, and prediction models cyclically, achieving more stable and accurate predictions. Extensive experiments show that our approach significantly outperforms the existing methods.  ( 2 min )
    Fixed-point iterations for several dissimilarity measure barycenters in the Gaussian case. (arXiv:2205.04806v1 [stat.CO])
    In target tracking and sensor fusion contexts it is not unusual to deal with a large number of Gaussian densities that encode the available information (multiple hypotheses), as in applications where many sensors, affected by clutter or multimodal noise, take measurements on the same scene. In such cases reduction procedures must be implemented, with the purpose of limiting the computational load. In some situations it is required to fuse all available information into a single hypothesis, and this is usually done by computing the barycenter of the set. However, such computation strongly depends on the chosen dissimilarity measure, and most often it must be performed making use of numerical methods, since in very few cases the barycenter can be computed analytically. Some issues, like the constraint on the covariance, that must be symmetric and positive definite, make it hard the numerical computation of the barycenter of a set of Gaussians. In this work, Fixed-Point Iterations (FPI) are presented for the computation of barycenters according to several dissimilarity measures, making up a useful toolbox for fusion/reduction of Gaussian sets in applications where specific dissimilarity measures are required.  ( 2 min )
    A Communication-Efficient Distributed Gradient Clipping Algorithm for Training Deep Neural Networks. (arXiv:2205.05040v1 [cs.LG])
    In distributed training of deep neural networks or Federated Learning (FL), people usually run Stochastic Gradient Descent (SGD) or its variants on each machine and communicate with other machines periodically. However, SGD might converge slowly in training some deep neural networks (e.g., RNN, LSTM) because of the exploding gradient issue. Gradient clipping is usually employed to address this issue in the single machine setting, but exploring this technique in the FL setting is still in its infancy: it remains mysterious whether the gradient clipping scheme can take advantage of multiple machines to enjoy parallel speedup. The main technical difficulty lies in dealing with nonconvex loss function, non-Lipschitz continuous gradient, and skipping communication rounds simultaneously. In this paper, we explore a relaxed-smoothness assumption of the loss landscape which LSTM was shown to satisfy in previous works and design a communication-efficient gradient clipping algorithm. This algorithm can be run on multiple machines, where each machine employs a gradient clipping scheme and communicate with other machines after multiple steps of gradient-based updates. Our algorithm is proved to have $O\left(\frac{1}{N\epsilon^4}\right)$ iteration complexity for finding an $\epsilon$-stationary point, where $N$ is the number of machines. This indicates that our algorithm enjoys linear speedup. We prove this result by introducing novel analysis techniques of estimating truncated random variables, which we believe are of independent interest. Our experiments on several benchmark datasets and various scenarios demonstrate that our algorithm indeed exhibits fast convergence speed in practice and thus validates our theory.  ( 2 min )
    Turtle Score -- Similarity Based Developer Analyzer. (arXiv:2205.04876v1 [stat.ML])
    In day-to-day life, a highly demanding task for IT companies is to find the right candidates who fit the companies' culture. This research aims to comprehend, analyze and automatically produce convincing outcomes to find a candidate who perfectly fits right in the company. Data is examined and collected for each employee who works in the IT domain focusing on their performance measure. This is done based on various different categories which bring versatility and a wide view of focus. To this data, learner analysis is done using machine learning algorithms to obtain learner similarity and developer similarity in order to recruit people with identical working patterns. It's been proven that the efficiency and capability of a particular worker go higher when working with a person of a similar personality. Therefore this will serve as a useful tool for recruiters who aim to recruit people with high productivity. This is to say that the model designed will render the best outcome possible with high accuracy and an immaculate recommendation score.  ( 2 min )
    Don't Throw it Away! The Utility of Unlabeled Data in Fair Decision Making. (arXiv:2205.04790v1 [stat.ML])
    Decision making algorithms, in practice, are often trained on data that exhibits a variety of biases. Decision-makers often aim to take decisions based on some ground-truth target that is assumed or expected to be unbiased, i.e., equally distributed across socially salient groups. In many practical settings, the ground-truth cannot be directly observed, and instead, we have to rely on a biased proxy measure of the ground-truth, i.e., biased labels, in the data. In addition, data is often selectively labeled, i.e., even the biased labels are only observed for a small fraction of the data that received a positive decision. To overcome label and selection biases, recent work proposes to learn stochastic, exploring decision policies via i) online training of new policies at each time-step and ii) enforcing fairness as a constraint on performance. However, the existing approach uses only labeled data, disregarding a large amount of unlabeled data, and thereby suffers from high instability and variance in the learned decision policies at different times. In this paper, we propose a novel method based on a variational autoencoder for practical fair decision-making. Our method learns an unbiased data representation leveraging both labeled and unlabeled data and uses the representations to learn a policy in an online process. Using synthetic data, we empirically validate that our method converges to the optimal (fair) policy according to the ground-truth with low variance. In real-world experiments, we further show that our training approach not only offers a more stable learning process but also yields policies with higher fairness as well as utility than previous approaches.  ( 2 min )
    KEMP: Keyframe-Based Hierarchical End-to-End Deep Model for Long-Term Trajectory Prediction. (arXiv:2205.04624v1 [cs.CV])
    Predicting future trajectories of road agents is a critical task for autonomous driving. Recent goal-based trajectory prediction methods, such as DenseTNT and PECNet, have shown good performance on prediction tasks on public datasets. However, they usually require complicated goal-selection algorithms and optimization. In this work, we propose KEMP, a hierarchical end-to-end deep learning framework for trajectory prediction. At the core of our framework is keyframe-based trajectory prediction, where keyframes are representative states that trace out the general direction of the trajectory. KEMP first predicts keyframes conditioned on the road context, and then fills in intermediate states conditioned on the keyframes and the road context. Under our general framework, goal-conditioned methods are special cases in which the number of keyframes equal to one. Unlike goal-conditioned methods, our keyframe predictor is learned automatically and does not require hand-crafted goal-selection algorithms. We evaluate our model on public benchmarks and our model ranked 1st on Waymo Open Motion Dataset Leaderboard (as of September 1, 2021).  ( 2 min )
    A Probabilistic Generative Model of Free Categories. (arXiv:2205.04545v1 [cs.AI])
    Applied category theory has recently developed libraries for computing with morphisms in interesting categories, while machine learning has developed ways of learning programs in interesting languages. Taking the analogy between categories and languages seriously, this paper defines a probabilistic generative model of morphisms in free monoidal categories over domain-specific generating objects and morphisms. The paper shows how acyclic directed wiring diagrams can model specifications for morphisms, which the model can use to generate morphisms. Amortized variational inference in the generative model then enables learning of parameters (by maximum likelihood) and inference of latent variables (by Bayesian inversion). A concrete experiment shows that the free category prior achieves competitive reconstruction performance on the Omniglot dataset.  ( 2 min )
    Matrix and graph representations of vine copula structures. (arXiv:2205.04783v1 [stat.ML])
    Vine copulas can efficiently model a large portion of probability distributions. This paper focuses on a more thorough understanding of their structures. We are building on well-known existing constructions to represent vine copulas with graphs as well as matrices. The graph representations include the regular, cherry and chordal graph sequence structures, which we show equivalence between. Importantly we also show that when a perfect elimination ordering of a vine structure is given, then it can always be uniquely represented with a matrix. O. M. N\'apoles has shown a way to represent them in a matrix, and we algorithmify this previous approach, while also showing a new method for constructing such a matrix, through cherry tree sequences. Lastly, we prove that these two matrix-building algorithms are equivalent if the same perfect elimination ordering is being used.  ( 2 min )
    Real-time Forecasting of Time Series in Financial Markets Using Sequentially Trained Many-to-one LSTMs. (arXiv:2205.04678v1 [cs.LG])
    Financial markets are highly complex and volatile; thus, learning about such markets for the sake of making predictions is vital to make early alerts about crashes and subsequent recoveries. People have been using learning tools from diverse fields such as financial mathematics and machine learning in the attempt of making trustworthy predictions on such markets. However, the accuracy of such techniques had not been adequate until artificial neural network (ANN) frameworks were developed. Moreover, making accurate real-time predictions of financial time series is highly subjective to the ANN architecture in use and the procedure of training it. Long short-term memory (LSTM) is a member of the recurrent neural network family which has been widely utilized for time series predictions. Especially, we train two LSTMs with a known length, say $T$ time steps, of previous data and predict only one time step ahead. At each iteration, while one LSTM is employed to find the best number of epochs, the second LSTM is trained only for the best number of epochs to make predictions. We treat the current prediction as in the training set for the next prediction and train the same LSTM. While classic ways of training result in more error when the predictions are made further away in the test period, our approach is capable of maintaining a superior accuracy as training increases when it proceeds through the testing period. The forecasting accuracy of our approach is validated using three time series from each of the three diverse financial markets: stock, cryptocurrency, and commodity. The results are compared with those of an extended Kalman filter, an autoregressive model, and an autoregressive integrated moving average model.  ( 2 min )
    Entropic CLT for Order Statistics. (arXiv:2205.04621v1 [cs.IT])
    It is well known that central order statistics exhibit a central limit behavior and converge to a Gaussian distribution as the sample size grows. This paper strengthens this known result by establishing an entropic version of the CLT that ensures a stronger mode of convergence using the relative entropy. In particular, an order $O(1/\sqrt{n})$ rate of convergence is established under mild conditions on the parent distribution of the sample generating the order statistics. To prove this result, ancillary results on order statistics are derived, which might be of independent interest.  ( 2 min )
    A Unified Bayesian Framework for Pricing Catastrophe Bond Derivatives. (arXiv:2205.04520v1 [q-fin.PR])
    Catastrophe (CAT) bond markets are incomplete and hence carry uncertainty in instrument pricing. As such various pricing approaches have been proposed, but none treat the uncertainty in catastrophe occurrences and interest rates in a sufficiently flexible and statistically reliable way within a unifying asset pricing framework. Consequently, little is known empirically about the expected risk-premia of CAT bonds. The primary contribution of this paper is to present a unified Bayesian CAT bond pricing framework based on uncertainty quantification of catastrophes and interest rates. Our framework allows for complex beliefs about catastrophe risks to capture the distinct and common patterns in catastrophe occurrences, and when combined with stochastic interest rates, yields a unified asset pricing approach with informative expected risk premia. Specifically, using a modified collective risk model -- Dirichlet Prior-Hierarchical Bayesian Collective Risk Model (DP-HBCRM) framework -- we model catastrophe risk via a model-based clustering approach. Interest rate risk is modeled as a CIR process under the Bayesian approach. As a consequence of casting CAT pricing models into our framework, we evaluate the price and expected risk premia of various CAT bond contracts corresponding to clustering of catastrophe risk profiles. Numerical experiments show how these clusters reveal how CAT bond prices and expected risk premia relate to claim frequency and loss severity.  ( 2 min )

  • Open

    IsaacGym vs. Brax?
    Is someone able to compare IsaacGym to Brax? I'm particularly interested in how efficient Brax is in comparison to IsaacGym when running on a strong Desktop PC. Some companies might have the resources to waste energy on inefficient CPU computations, I don't. So I'm not really interested in the distributed computing part of Brax, at least as far as simulation on CPUs is concerned. submitted by /u/felixcra [link] [comments]  ( 1 min )
    Question about the OpenAI 5 Dota 2 Bot
    I have been reading about this bot and it says it learned by playing an insane number of games that are equivalent to many years real time. My question is how it is possible to play this many games? The only way I had known of previously when it comes to reinforcement learning is to completely re-code the game so you don't have to deal with the animations and time that comes with actually playing the game and can speed it up. The only other way I could think of is if they got hundreds of thousands of different accounts to play Dota 2 but that's certainly not what they did. Does anyone know how they were able to speed the playing process up to play so many games? submitted by /u/TheGeniusSkipper [link] [comments]  ( 1 min )
    Callbacks in the __init__ method of a gym environment
    I am working with a multiagent gym environment and this is the init method: def __init__(self, world, reset_callback=None, reward_callback=None, observation_callback=None, info_callback=None, done_callback=None, post_step_callback=None, shared_viewer=True, discrete_action=True): Can someone explain what those callbacks are? I would like to send at each timestep the policy LSTM state to the env and I would like to do so by passing it in the step function along with the action. Does that make sense? submitted by /u/No_Possibility_7588 [link] [comments]  ( 1 min )
    Books for engineers with ML experience?
    I've been working in ML and AI for a while, so I am familiar with deep neural networks and implementing them in pytorch and tensorflow. But every time I try to work on reinforcement learning side project I get end up with models that don't seem to learn anything useful. Part of it may just be the reward space and learning process is different than say training a conv network to extract features and label images which is a much easier to define problem. The state spaces, time component, and lack of clear labels just keeps tripping me up. I'd love a book I can work through to help me get through these issues, but don't really want to start with "what is a neural network" etc. submitted by /u/caedin8 [link] [comments]  ( 2 min )
    How to create a done and reward function for drone model?
    I want to implement DDPG algorithm on a drone in pybullet-drones simulator. In this simulator, I have following observations of a drone: X, Y, Z position in WORLD_FRAME (in meters, 3 values) Quaternion orientation in WORLD_FRAME (4 values) Roll, pitch and yaw angles in WORLD_FRAME (in radians, 3 values) The velocity vector in WORLD_FRAME (in m/s, 3 values) Angular velocity in WORLD_FRAME (3 values) Motors' speeds (in RPMs, 4 values) I want the drone to follow a circular trajectory. And simulation to be done when drone hits the ground. So how can I create a reward and done function for this problem? submitted by /u/Better-Ad8608 [link] [comments]  ( 1 min )
    How to utilize the existing while training the agent
    Hi all, I am currently trying to teach my robot-manipulator how to reach a goal position by considering the overall energy consumption. Here, I would like to integrate the existing knowledge such as "try to avoid using q1, as it consumes a lot of energy". How could I initialize the training by utilizing this knowledge to boost the training speed? submitted by /u/Fun-Moose-3841 [link] [comments]  ( 1 min )
    I am training a chess AI using self play. I want to train the AI and then save the model and make it play against the stockfish engine. If I set reset timestep to False will it still continue training the same model when I call model.learn again
    submitted by /u/madmax_wush [link] [comments]  ( 1 min )
    Help needed finding a "complex" RL environment to test an idea on
    Hi All, I am a master student in AI and I have developed a new training method and have been testing it on a "simple" environment implemented in Python/Gym/Stable-baselines3/Pygame. My supervisor has said that I need to test the idea on a more "complex" environment, preferably with a high dimensional state space (perhaps in the hundreds or thousands). Are there any "complex" environments that would be easy to quickly implement in Python/Gym/Pygame? I am running out of time so can't build the environment from scratch. Thanks alot in advance. submitted by /u/C_BearHill [link] [comments]  ( 1 min )
    Conv3D Gym Env?
    Hi all, I'm working on a RL project in the medical domain, and I want to use 3D grids as the observation for my agent. I have two questions: Does anyone know of any work that's been done using 3D grids and Conv3D? Preferably as a Gym env! I've seen a lot of work with 3D environments but the observation is usually just an image of the 3D env.. Would it be better to use mesh or voxels as the input? Thank you! submitted by /u/leozinho2r [link] [comments]  ( 1 min )
    Evaluation seed
    I train with PPO using 64 envs, each of them having a different seed, and then evaluate on 10 envs with different seeds. Can you tell me please if it is recommended to use in evaluation a sample of the seeds used in the train, or if they must be completely different? submitted by /u/FaithlessnessSuper46 [link] [comments]  ( 1 min )
  • Open

    Window to the Fourth Dimension | @artificial_artists on Instagram
    submitted by /u/atster11 [link] [comments]
    Blood Bridge
    submitted by /u/Hacknaut [link] [comments]
    Hey everyone! I'm not sure if this is the place for this, but I made an Instagram to post all the art I create on Dream by Wombo. Some of the pieces are really cool and I'm gonna keep posting on there so I figured I'd share the page with all of you in case you're interested! @Artificial_artists
    submitted by /u/atster11 [link] [comments]  ( 1 min )
    Data Science Interview Questions
    Preparing for a data scientist interview is tough since the questions you will be asked about data science are unclear. An interviewer may surprise you with a set of unexpected questions, regardless of how much job experience you have or what data science credentials you possess. Read more submitted by /u/ridamughal110 [link] [comments]
    Use of AI in the Manufacturing Industry
    With continuous advancements in Artificial Intelligence (AI), the manufacturers are spearheading to apply of it in their manufacturing processes to boost product quality, operational efficiency, workforce safety, and many more. Read more submitted by /u/ridamughal110 [link] [comments]
    Aiplague - Lost Underwater Garden (4K 60 FPS) Disco Diffusion
    submitted by /u/nalr00n [link] [comments]
    Doing body measurements with AI
    Hi r/artificial, I am developing an app in which I have to measure a person's measurements. I read somewhere that Artificial Intelligence is being used to measure houses etc. Is it already possible to use AI to do body measurements of people? For example, their arm or chest size? Also, I would like to know what the costs are of developing this feature. Can you guys help me pursue my idea? submitted by /u/notmycupofnft [link] [comments]  ( 1 min )
    AI Dream 44 - Epic Trippy Dream (Dragon Wave) 8x
    submitted by /u/LordPewPew777 [link] [comments]
    New AI Robot Chef R2-D-Chew Learns To 'Taste' To Improve Its Cooking
    submitted by /u/getrich_or_diemining [link] [comments]
    Overview of GraphSage for GNNs
    submitted by /u/aidev2040 [link] [comments]
    Question about Roko's Basilisk (I'm not here to create discussion, it's about a uni essay)
    Sorry if you are tired of this stupid basilisk, I don't believe in it, but I'm doing an essay about it and I have a question: Why does it only punish those who have heard about him? Why not punish everyone? I know it has to do about Pascal's wager and that "God doesn't punish those who don't know about them" but I still didn't really understand it. My google-fu is too weak for this one, I searched and didn't find anything. Thank you for taking the time for reading this! submitted by /u/Zholotoi [link] [comments]  ( 1 min )
    Ethical Concerns as a Result of AI!
    Machine learning is fast progressing thanks to neural networks for the following reasons: The number of data banks has exploded. The massive boost in processing power Algorithms for machine learning have vastly improved. Read full article: https://us.sganalytics.com/blog/top-ethical-challenges-in-ai-the-price-of-progress/ submitted by /u/JencyJane [link] [comments]
    BlobGAN enables object manipulation in an image
    submitted by /u/imapurplemango [link] [comments]  ( 1 min )
    How did you learn Ai?
    submitted by /u/SATelite_Media [link] [comments]  ( 1 min )
    Dynasties and Dystopia (made with starryai)
    submitted by /u/Losthel [link] [comments]  ( 1 min )
    Image Classification With TensorFlow.js
    submitted by /u/RubiksCodeNMZ [link] [comments]
    Deepmind's latest AI has better visual understanding by combining a visual model and a language model
    submitted by /u/Zirius_Sadfaces [link] [comments]
    How do I create GAN imagery like Brock Hampton’s visualizers for ROADRUNNER?
    submitted by /u/AMORALESPLATA [link] [comments]
    MIT Researchers Create ‘ExSum’: A Mathematical Framework To Evaluate Explanations Of Machine Learning Models And Quantify How Well People Understand Them
    Machine learning is frequently referred to as a “black box” because the interactions between input and output become increasingly opaque as the model’s complexity grows. People’s understanding of these models is confined to how data is entered and final choices are made, and there is not much clarity in how these models make their predictions. While previous research has been done to determine how accurate the explanations given by these models are, the question of how quickly and reliably individuals grasp these models remains an unexplored territory. Interpretability methods are being developed to comprehend better the functioning of blackbox models, which is necessary for their reliable deployment. As a stepping stone in this field, researchers from the Computer Science and Artificial Intelligence Laboratory and Microsoft Research have created ground-breaking research by developing a mathematical framework called explanatory summary (ExSUM) for evaluating and quantifying how well individuals understand machine learning models. ExSUM exposes different flaws in existing practice and aids in developing accurate model knowledge by identifying the model’s easily missed features. Other explanations, such as human alignment, robustness, and counterfactual minimality, are also included in the framework. The findings will be presented at the Conference of the North American Chapter of the Association for Computational Linguistics. Continue Reading Paper: https://arxiv.org/pdf/2205.00130.pdf Github: https://yilunzhou.github.io/exsum/ submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    27 audio samples based on Kendrick Lamar's 'Top of the Morning' hook & VQGAN + Clip art
    submitted by /u/gloriousapplecart [link] [comments]
  • Open

    [D] Thoughts on Seldon and their MLOps?
    Context: I am an AI Product Manager at a start up, trying to clean up their ML architecture (moving from batch inference to real-time at the moment). Just wondering if anyone has worked with Seldon before their MLOps needs? They claimed their turnaround time could be a week (I am heavily skeptical of this), but they do claim to serve some bigger clients so just wondering on folks' thoughts! submitted by /u/baxter3851 [link] [comments]  ( 1 min )
    [P] Accelerated Inference with Optimum and Transformers Pipelines
    Hey there 👋 It’s Lewis here from the open-source team at Hugging Face 🤗. I'm excited to share the latest release of our Optimum library, which provides a suite of performance optimization tools to make Transformers run fast on accelerated hardware! This release introduces a new set of inference classes, which resemble the autoclasses from Transformers, but run an ONNX model on ONNX Runtime. These classes are API compatible with ordinary PyTorch / TensorFlow models, which means you can run inference using the `nifty pipeline() function from Transformers! Here's a quick example: from transformers import AutoTokenizer, pipeline from optimum.onnxruntime import ORTModelForQuestionAnswering # Load ONNX checkpoint using the ONNX Runtime inference class model = ORTModelForQuestionAnswering.…  ( 1 min )
    [R] RWKV-v2-RNN : A parallelizable RNN with transformer-level LM performance, and without using attention
    Hi guys. I am an independent researcher and you might know me (BlinkDL) if you are in the EleutherAI discord. I have built a RNN with transformer-level performance, without using attention. Moreover it supports both sequential & parallel mode in inference and training. So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding. https://github.com/BlinkDL/RWKV-LM I am training a L24-D1024 RWKV-v2-RNN LM (430M params) on the Pile (trained for 25B tokens as of now) with promising results, and it might reach GPT-Neo performance within 100B tokens. https://preview.redd.it/aftmbw4b4py81.png?width=852&format=png&auto=webp&s=56cf061dba0e90c79cbec7f37de86921bad3744c All of the trained models will be open-source. Inference is very fast (only matrix-vector multiplications, no matrix-matrix multiplications) even on CPUs, and I believe you can run a 1B params RWKV-v2-RNN with reasonable speed on your phone. It is inspired by Apple's AFT (https://arxiv.org/abs/2105.14103) with a number of my own tricks, such as: RNNify it, and use my CUDA kernel to speedup training (https://github.com/BlinkDL/RWKV-CUDA) Token-shift (https://github.com/BlinkDL/RWKV-LM#token-shift-time-shift-mixing) SmallInitEmb (https://github.com/BlinkDL/SmallInitEmb) which helps the embedding quality, and stabilizes Post-LN (which is what I am using). I also transferred some time-related parameters from a small model to a large model, to speed up the convergence. Basically the model learns to focus more on short-distance interactions in early layers, and long-distance interactions in later layers. https://preview.redd.it/ibk4ic0b6py81.png?width=865&format=png&auto=webp&s=78e4f794abd0fe25c8af8fd6634836a472e4120a Please feel free to ask questions :) submitted by /u/bo_peng [link] [comments]  ( 3 min )
    [D] Simple question - Does anyone know the size of BUFF dataset?
    Hi everyone, Upon reading [PIFuHD: Multi-Level pixel-aligned implicit function for high-resolution 3d human digitization](https://arxiv.org/abs/2004.00452), I recognized that they used [BUFF dataset](https://buff.is.tue.mpg.de/index.html) which is open only for academic purpose. ​ To access this data, I have to register academic e-mail and send some declaration papers.... ​ I just want to know how big this dataset is, maybe more than TB..? ​ Does anyone know how big this is? submitted by /u/Frequent-Desk9222 [link] [comments]  ( 1 min )
    [D]How to evaluate complementary datasets for ML models?
    Evaluating ML models is a fundamental task and subfield of the Machine Learning practice. On the other hand, I was not able to find any existing materials, guides, protocols, papers on how to proceed with evaluating/scoring complementary datasets (resulting multiple new features a.k.a feature set) when added to an existing model (and retrained with these new features added). This can be rather described as a with/without based comparative approach. Let's say that there is some kind of ensemble model (e.g. catboost, lighGBM) trained on some X_0 and y data with all the relevant testing metrics for that model type (classification or regression). We receive some new feature sets, X_1, X_2 ... X_n. Our goal is to get an overview whether [X_0 X_1], [X_0 X_2] .... [X_0 X_n] extended feature mat…  ( 2 min )
    [P] Coco Image Semantic Segmentation Dataset Generator Command Line Tool
    Hi everyone Just wanted to share a really small (micro) python command line utility that interfaces with `pycoco` library that allows you to generate tiff masks from the coco image dataset for training with semantic segmentation (i.e. UNet) where you can also filter by categories. Found it useful if you want to extract specific images from the Coco dataset for your own semantic segmentation project. Tiff images contain pixel values already representing class labels 0, 1, 2 etc... Hope someone else finds it useful! https://github.com/ralampay/pycocosegmentor submitted by /u/ralampay [link] [comments]  ( 1 min )
    [Research] Industry Analysis: The AI Fairness Toolkits Landscape
    A thoughtful approach to AI ethics is becoming increasingly important for all organizations deriving value from AI. We hope that by providing an overview of the top toolkits and resources that exist – starting with Fairness and Robustness – will help more companies adopt AI responsibly, with ethical principles at the core. https://www.borealisai.com/en/blog/industry-analysis-ai-fairness-toolkits-landscape/ submitted by /u/BorealisAI [link] [comments]  ( 1 min )
    [D] Data Scientist, ML Research Scientist, MLE, and Data Engineer
    I am new to non-academic research, but it seems like data scientists sometimes get unnecessary shat on. From talking to people and reading job descriptions, DS (especially at a startup) often seems closer to a Research Scientist than MLE, and especially DE. If what you want to do is research and model development it seems like DS is the best option next to RS - which seem to be the hardest positions to get. I was initially weary of taking a DS role that might lock me out of being considered for future RS positions, but after getting more exposure I'm doubting that is true. What do you think? I'm curious to hear more opinions because lots of people/companies seem to have different definitions for these roles. submitted by /u/Althonse [link] [comments]  ( 2 min )
    [R] NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality
    submitted by /u/tobyoup [link] [comments]  ( 3 min )
    [P] Cluster and analyze text using both embedding and GPT models (interactive visualizations)
    Hi r/MachineLearning, You have likely seen umap visualizations of text embeddings before. I have been interested in imposing more information on such a figure to help in a text analysis flow. So I'd love to share two interactive exhibits with you: 1) Top 10,000 Hacker News articles of all time 2) Top 3,000 posts in Ask HN I found the following steps produce an interesting result: Embed the text (for this example, I'm using the titles of the top scoring 10K posts on hacker news) Cluster the embeddings (kmeans) UMAP and impose the clusters on the plot Extract keywords from the clusters (using cTFIDF from the awesome BERTopic, which now also supports kmeans!) Cluster Naming: The last step is work in progress, which is to have a generative language model assign a better name to a cluster (using the extracted keywords). The post is linked below. In addition to finding super interesting discussions on life insights, software career insight, content recommendations, it also contains a notebook and the dataset as well as the embeddings of the top 3K posts in Ask HN. ​ Combing For Insight in 10,000 Hacker News Posts With Text Clustering https://txt.cohere.ai/combing-for-insight-in-10-000-hacker-news-posts-with-text-clustering/ ​ Hope you find it useful! I'm fascinated about flows that use both generative and embeddings models as these are, for now, two stable NLP methods in a fast-changing field. Feel free to share any experiences or projects you've done along those lines! submitted by /u/jayalammar [link] [comments]  ( 2 min )
    [P] New lightweight library for head-gesture detection
    The Nodding Pigeon library provides a pre-trained model and a simple inference API for detecting head gestures in short videos. Under the hood, it uses Google MediaPipe for collecting the landmark features. For ML practitioners, this project is also an example of using generative data from a small base-dataset for model training. Please take a look! :) https://github.com/bhky/nodding-pigeon submitted by /u/xtorch501 [link] [comments]  ( 1 min )
    Reconstructing data [Discussion]
    Hey guys, as a physicist, I learned supervised ML on the fly. I optimize multi-variable functions to reproduce quantum mechanics simulations.I have a quite different problem here. I would like to reconstruct data. I have thousands of line of data, each line has several attributes. One of those attributes is missing for some data. It can only have a few possible values. I need to reconstruct it, based on the other attributes. I started to dig a bit the possible methods I could use. Could you give me a few advices, opinions, to get me started? I would greatly appreciate any income. PS: I code in python submitted by /u/ant_two [link] [comments]  ( 3 min )
    [R] DALLE-2 paper explained - model architecture, results and image manipulation
    Here is a video explaining the model architecture of the DALLE-2 architecture: https://youtu.be/Z8E3LxqE49M The paper title is, "Hierarchical Text-Conditional Image Generation with CLIP Latents" and the arxiv link to the paper is here: https://arxiv.org/abs/2204.06125 Official website is here: https://openai.com/dall-e-2 Hope its useful. submitted by /u/Combination-Fun [link] [comments]
    [P] Open-source to speed up deep learning inference by leveraging multiple optimization techniques (deep learning compilers, quantization, half precision, etc)
    Hi everyone, my name is Emile. After trying for a long time many tools to speed up AI inference, with some colleagues I have built an open-source library to bundle the best techniques you could find around and consolidated them into a single interface, this opensource library (today's release features the integration of most of these techniques). Let me know what you think of this OSS! Thank you This library is called nebullvm and takes your AI model as input and outputs an optimized version that runs 2-30 times faster on your hardware. Nebullvm tests multiple optimization techniques (deep learning compilers, quantization, sparsity, distillation, and more) to identify the optimal way to execute your AI model on your specific hardware. The library can speed up your model 2 to 10 times without loss of performance (thanks to deep learning compilers, e.g. tensorrt, openvino, mlir, etc.), or up to 20-30 times if you specify that you are willing to trade off a self-defined amount of accuracy/precision to achieve even lower latency and a lighter model (which might be useful for edge devices). This further acceleration is achieved by leveraging techniques that slightly modify the graph of the model to make it lighter, such as quantization, half-precision distillation, sparsity, etc. (more information about these techniques on nebullvm github https://github.com/nebuly-ai/nebullvm). The goal of nebullvm is to help other developers benefit from the most advanced inference optimization techniques without having to spend countless hours understanding, installing, testing and debugging these powerful technologies. I hope you enjoy the project, and let me know if you have any comments or advice or the acceleration you get on your model/hardware submitted by /u/emilec___ [link] [comments]  ( 2 min )
    [D] CIKM 2022 - Call for Demonstrations format
    As part of my PhD I have developed a web application in which one can apply different explainability methods on deep learning models and visualize the outcomes. The web application also addresses some specific use-cases which I (currently) think have not been addressed thus far. The application is in an early prototype stage, which is why I was considering submitting a demonstration paper to the CIKM demonstrations track, since it might seem to fit pretty well, when considering the format of papers that have been accepted in the previous years as demos. However, I am not familiar with how the presentation format looks like in the case that a demo paper gets accepted. Is it similar to a poster session where, in the case of a demo, one would stand beside their laptop and demonstrate the application to other attendees as they pass by, or does everyone with an accepted demo paper get a small time slot where they have 15-20 minutes to present the application and than receive questions? I would appreciate it if anyone who attended previous CIKM demo tracks could explain the format a bit. submitted by /u/purposeless_username [link] [comments]  ( 1 min )
    [D] Is it true that every algorithm to detect deepfakes can be used to generate better deepfakes?
    This position seems somehow rational but I seem to remember I read somewhere, a long time ago, that this was indeed not the case but I can't remember what the argument was based upon. Is the game of detecting/creating deepfakes really a cat and mouse game? submitted by /u/kugkfokj [link] [comments]  ( 3 min )
    [P] Deep RL Zoo - A collection of Deep RL algorithms implemented with PyTorch
    Hi guys, I recently created a new repo on Github, it contains a lists of RL agents to solve discrete action space problems like classic control and Atari games. It includes the most recent algorithms from DeepMind like Never Give Up and Agent57 (also not fully tested on Atari games yet because lack of hardware resources). Hope you will find it helpful. https://github.com/michaelnny/deep_rl_zoo The post was originally posted on r/reinforcementlearning a few days ago, but I though re-posting here might reach more people, have a good day! submitted by /u/Top_Serve_2348 [link] [comments]  ( 1 min )
  • Open

    Is Data Mesh Fool’s Gold? Creating a Business-centric Data Strategy
    After talking to many customers recently at Dell Technologies World, I am very, very (very!) concerned how many organizations are putting their Data Strategy success into the hands of Data Meshes. Sorry, but I think the way that IT organizations are thinking about a Data Mesh is fool’s gold. I think the data mesh (along… Read More »Is Data Mesh Fool’s Gold? Creating a Business-centric Data Strategy The post Is Data Mesh Fool’s Gold? Creating a Business-centric Data Strategy appeared first on Data Science Central.  ( 6 min )
    Top Technological HR Skills for the HR Professional
    The Human Resource Management department had a very challenging time due to the pandemic. Globally, the coronavirus pandemic has created a lot of havoc. People were more concerned about health and safety. This pandemic has given rise to new realities like: Social distancing Online (virtual) work  Self-isolation Lockdowns Shutdowns Quarantine Shelter-in-place Essential businesses These concepts gave… Read More »Top Technological HR Skills for the HR Professional The post Top Technological HR Skills for the HR Professional appeared first on Data Science Central.  ( 4 min )
  • Open

    Create video subtitles with Amazon Transcribe using this no-code workflow
    Subtitle creation on video content poses challenges no matter how big or small the organization. To address those challenges, Amazon Transcribe has a helpful feature that enables subtitle creation directly within the service. There is no machine learning (ML) or code writing required to get started. This post walks you through setting up a no-code […]  ( 9 min )
  • Open

    Overview of GraphSage for GNNs
    submitted by /u/aidev2040 [link] [comments]
    Image Classification With TensorFlow.js
    submitted by /u/RubiksCodeNMZ [link] [comments]
  • Open

    Here is how an R-Learning agent beats humans
    Human intuition is usually good at dealing with concepts like averages or mean-values, whereas it often performs poorly when it comes to…  ( 9 min )
  • Open

    Creator Karen X. Cheng Brings Keen AI for Design ‘In the NVIDIA Studio’
    The future of content creation is in AI. This week In the NVIDIA Studio, discover how AI-assisted painting is bringing a new level of inspiration to the next generation of artists. The post Creator Karen X. Cheng Brings Keen AI for Design ‘In the NVIDIA Studio’ appeared first on NVIDIA Blog.  ( 4 min )
  • Open

    Approximating a golden spiral with circular arcs
    The previous post included this image of a logarithm spiral passing through the corners of squares in a sequence of golden rectangles. The portion of the spiral in each square looks like a quarter of a circle. How well would circular arcs approximate the spiral? Very well. Here’s a plot. The circular arc inside the […] Approximating a golden spiral with circular arcs first appeared on John D. Cook.  ( 1 min )

  • Open

    [P] It's settled: AutoArima is a lot(!) faster and more accurate than FB-Prophet. Now you can replace it with just two lines of code without making changes to your pipeline
    We benchmarked on more than 100K series and show that you can improve MAPE forecast accuracy by 17% with 37x less computational time using Nixtlas StatsForecast. That's the difference between paying $10 or $296 on AWS. It’s time to overcome the false prophets. Check Nixtla's FB-Prophet adapter: https://github.com/Nixtla/statsforecast/tree/main/experiments/arima_prophet_adapter ​ https://preview.redd.it/fs3zqm4pcjy81.png?width=1280&format=png&auto=webp&s=326c82f04fae7df8934a434ba03fd5e683eadb99 The two lines you need submitted by /u/fedegarzar [link] [comments]  ( 1 min )
    [D] What are some good platforms to host new datasets published in ML conferences?
    I recently published an NLP paper which included a new dataset. What are some good platforms to host new datasets (our dataset is around 5GB)? Preferably, platforms which allow adding citation information and have low costs! I am currently checking out huggingface- but seems like there are some size and dataset type limitations in it. submitted by /u/shivamag99 [link] [comments]  ( 1 min )
    [D] Looking for the name of a particular regression technique
    This is a method I've read somewhere, but I can't remember the reference nor its name. You have (say) n = 500 points (say) in two dimensions. You draw a line for each pair of points. That is, you have n(n-1)/2 lines. If n is too large, you can sample a few thousands lines. The envelope of these line is your prediction band. The confidence level depends on the number of lines in your plot. It is entirely model-free, data driven. It also works if instead of a line, you use a 2-parameter curve (say you are dealing with logistic regression, rather than linear regression). Note that one of the lines will be the best fit according to L1 (that is, the one minimizing the absolute residual error), while the traditional regression line is a best L2 fit (minimizing square of residual error). The L1 version is more robust against outliers. This generalizes to d dimensions, with lines replaced by hyperplanes or hyper-surfaces. In this case, you need to look at all combinations of d points taken together to determines the all the hyperplanes. In practice, you only use a sample. How is the method referred to in the literature? What would be a good reference on the topic? submitted by /u/MLRecipes [link] [comments]  ( 1 min )
    [D] Beyond Message Passing: a Physics-Inspired Paradigm for Graph Neural Networks
    Hi there, The Gradient has a new article that many of you might find interesting - Beyond Message Passing: a Physics-Inspired Paradigm for Graph Neural Networks . The message-passing paradigm has been the “battle horse” of deep learning on graphs for several years, making graph neural networks a big success in a wide range of applications, from particle physics to protein design. From a theoretical viewpoint, it established the link to the Weisfeiler-Lehman hierarchy, allowing to analyse the expressive power of GNNs. We argue that the “node and edge-centric” mindset of current graph deep learning schemes imposes strong limitations that hinder future progress in the field. As an alternative, we propose physics-inspired “continuous” learning models that open up a new trove of tools from the fields of differential geometry, algebraic topology, and differential equations so far largely unexplored in graph ML. submitted by /u/regalalgorithm [link] [comments]  ( 1 min )
    [N] Hugging Face raised $100M at $2B to double down on community, open-source & ethics
    👋 Hey there! Britney Muller here from Hugging Face. We've got some big news to share! Hugging Face Full Series C Announcement: https://huggingface.co/blog/series-c TechCrunch: https://techcrunch.com/2022/05/09/hugging-face-reaches-2-billion-valuation-to-build-the-github-of-machine-learning/ We want to have a positive impact on the AI field. We think the direction of more responsible AI is through openly sharing models, datasets, training procedures, evaluation metrics and working together to solve issues. We believe open source and open science bring trust, robustness, reproducibility, and continuous innovation. With this in mind, we are leading BigScience, a collaborative workshop around the study and creation of very large language models gathering more than 1,000 researchers of all backgrounds and disciplines. We are now training the world's largest open source multilingual language model 🌸 Over 10,000 companies are now using Hugging Face to build technology with machine learning. Their Machine Learning scientists, Data scientists and Machine Learning engineers have saved countless hours while accelerating their machine learning roadmaps with the help of our products and services. ⚠️ But there’s still a huge amount of work left to do. At Hugging Face, we know that Machine Learning has some important limitations and challenges that need to be tackled now like biases, privacy, and energy consumption. With openness, transparency & collaboration, we can foster responsible & inclusive progress, understanding & accountability to mitigate these challenges. Thanks to the new funding, we’ll be doubling down on research, open-source, products and responsible democratization of AI. submitted by /u/Britney-Ramona [link] [comments]  ( 3 min )
    [D] Are there any software engineers that switched into a machine learning role and found it a lot more stressful due to deadlines combined with the uncertainty of research?
    [D] Are there any software engineers that switched into a machine learning role and found it a lot more stressful due to deadlines combined with the uncertainty of research? I found that after switching to an ML role where I am both managing and improving data pipelines while experimenting with different ML models and doing feature engineering is a lot more stressful than a standard SWE job. As a SWE I had tasks, and I knew I was capable of doing all of them given some amount of time, and I just peacefully worked my way through it and had a lot of down time. With ML, I am continually trying new things, having to wait for the model to train and be evaluated, all the while to come back with incremental to zero improvements while still having to meet monthly or quarterly milestones. I don’t have that guarantee that I know I can complete a task given some amount of time. There’s so much uncertainty in the ML research process and the stress is making me want to switch back to SWE. The data pipelining and SWE parts are easy, it’s the contemplation of why my model isn’t working, and what is the next appropriate step to take to improve it that is difficult for me. 90% of the time the things I try don’t lead to any improvement. To clear things up, I don’t have a PhD in statistics or CS, only an BS/MS in CS so while I studied ML from the CS side, my statistical understanding of ML isn’t the best, which I believe may be contributing to this. Any one else deal with this? Is this a common feeling in this field? submitted by /u/Legitimate_Bison3756 [link] [comments]  ( 10 min )
    [D] Stats person in a team with all computer science colleagues....
    This is something I am really curious to know.... Are there guys out there with 'predominantly stats' background who are working in a machine learning / deep-learning team with all colleagues from computer science degrees? If so, could you please respond with your experience on: - What are the major challenges you are facing, if any. - Do you find the thought processes and approach of your colleagues to data, model etc. very different from yours? Does it cause any issues in your work? - Do you ever feel overwhelmed with software development practices, installation, code check-in issues etc. etc. ? For slightly better context: - By 'predominantly stats' background, I mean folks with primary degrees in stats, with basic training in coding in Python R etc but little or no training in stuff like ML engineering, building pipelines, code check-in check-out, developing code for very complex deep learning models with many classes, modules, libraries etc. etc. I'd really appreciate an in-depth answer. submitted by /u/sol2296 [link] [comments]  ( 2 min )
    Pros/cons of using model on different data without retraining? [D]
    Hi all, I'm looking into a classification problem with each observation being an aggregated geographical area, which I've trained some classifying algorithms on. The response variable is a binary indicator of poverty. My question is what are benefits or drawbacks of using my best model without re-training it on even lower disaggregated data or data for different time periods? Thanks for your thoughts! submitted by /u/Pickles654 [link] [comments]  ( 1 min )
    [D] What are some ways to improve your *science* for a top-tier ML conference?
    There seemed to be some consensus in the responses to yesterday’s post asking about ways to improve your paper for a top-tier ML conference—the primary thing you should do is ensure your research makes a top-tier worthy contribution. That itself seemed like it might be worthy of further exploration, so I thought I would continue the discussion: What traits characterize a high value research idea or insight? What are good strategies for pursuing one? submitted by /u/EmmyNoetherRing [link] [comments]  ( 2 min )
    [D] [R] Does anyone know how to implement the MostPop algorithm in recommendation systems?
    The MostPop algorithm is basically an algorithm that calculates metrics based on the most popular items at the time. This algorithm keeps popping up in papers but there is no description of it being used to calculate RMSE or such errors. Using the knowledge that I currently have I can calculate metrics such as Hit Rate (how many users have interacted with the item i) but is there any way that I can use a modification of this algorithm or this algorithm itself to predict what rating would be given to item "i" by user "x"? submitted by /u/GustaMusto [link] [comments]  ( 1 min )
    [D] IJCAI 2022 camera ready review notification.
    Has anyone got the email from the editors after submitting the camera ready ? submitted by /u/random_effective [link] [comments]
    [D] Is there any AI solution to "fill in" images between images?
    For example, I provide a 2d image of a stickman on the left of the frame and a second image of the stickman on the right of the frame (as if he was moving from left to right). Would there be an AI solution that would output me an image of this stickman in the middle of the frame? Like an AI solution that guesses that this is a stickman moving from left to right and creates the image in-between the 2 images I provided. ​ I'm asking this while thinking about creating an animation using AI where I just provide the "main" images/drawings and it "fills out" the rest. I google a lot and couldn't find anything suitable for this. I'd love hearing from you recommended libs/github profiles/notebooks that could help me in achieving this. I'm a musician and data science practitioner and I'd love if I could unite these two worlds in a video clip. submitted by /u/refrigerador82 [link] [comments]  ( 2 min )
    [D] Looking for ideas on NLP project
    Hello, I am trying to use text mining/text processing/NLP to apply certain changes to a big text file. More precisely, I have a big text file in html format (specifically a law). This law is then amended or changed by another law (a much smaller html file). For example, a law can be something like: The ten commandments Thou shalt have no other gods before me Thou shalt not make unto thee any graven image Thou shalt not take the name of the Lord thy God in vain etc... Then a change would be something like: Changes to the Ten commandments In The ten commandments, item 1 is changed to "Thou shalt not murder". Item 2 is removed. In item 3, after the word "vain", a comma is added followed by the sentence "unless the situation calls for it". The output needs to be: The ten commandments Thou shalt not murder Thou shalt not take the name of the Lord thy God in vain, unless the situation calls for it It's a really dumb example but that's the gist of it. The laws are on average 10 to 40 pages long while the changes are usually a couple pages long. Basically, the input would be one giant text file and either one or if possible multiple text files that contain changes to be made to the original file. The output would be an amended law with all the changes applied. The language of the text is not English or any other popular language. The changes are not written in a standardized way and there is also declination and typos to consider, which is why I would prefer using an ML approach rather than analysing the text by brute force. How would you go about approaching a problem like this? I have some experience with neural nets and machine learning, but very little in NLP so I am pretty lost. submitted by /u/WaferPlenty677 [link] [comments]  ( 1 min )
    [R] Happy to share my paper and Python code on efficient implementation of incremental proximal-point method for training machine-learning models.
    Paper: https://arxiv.org/abs/2205.01457 Code: https://github.com/alexshtf/inc_prox_pt Models are often trained using variants of the gradient update rule: xₜ₊₁ = xₜ-β∇ƒ(xₜ) where x is the model parameters vector, and ƒ is the cost function associated with the current (mini-batch of) training sample. This rule has another well-known interpretation - the proximal view: xₜ₊₁ = argmin { ƒ(xₜ) + ⟨∇ƒ(xₜ), x-xₜ⟩ + β/2 ‖x-xₜ‖² }, meaning "balance between minimizing a linear approx. of ƒ at xₜ and being close to xₜ". The step-size β determines the balance between the two opposing forces. So what happens if we replace the linear approximation with the cost function itself? We obtain xₜ₊₁ = argmin { ƒ(x) + β/2 ‖x-xₜ‖² }. This is called the "proximal operator of ƒ", and algorithms of this sort are called "proximal-point methods". So why not always exploit the cost itself, instead of just its slope? What's the catch? Depending on the complexity of ƒ, the above can be extremely challenging to compute. So why should we bother? Various works, for example https://arxiv.org/pdf/1903.08619.pdf by Asi and Duchi, show that such methods provide better resilience to step-size choice, which may make hyper-parameter tuning much cheaper - just guess a few step-sizes and you will get decent performance. Various extensions and works in online optimization also show nice stability properties, such as resilliance to temporal variability of the adversarial cost function. In this paper I developed efficient algorithms for computing the above step for a variety of functions ƒ useful in machine learning, and published a Python library based on PyTorch, which practitioners can use for some problem classes described in the paper, and researchers developing new proximal-point methods can use it to numerically demonstrate their brand-new "super proximal-point method with some cool momentum" on interesting problems. submitted by /u/alexsht1 [link] [comments]  ( 3 min )
    [D] What is the largest / most diverse GAN model currently out there?
    Hi community, I'm currently building a fork of StyleCLIP global directions which allows you to control multiple semantic parameters simultaneously to generate and edit an image with StyleGAN and CLIP in realtime. I want to showcase its potential as a design tool. Unfortunately, GAN weights are trained on very domain-specific (faces, cars, churches) data. This makes them inferior to modern diffusion models which I can use to generate whatever comes to mind. Although I know we won't have a GAN-based DALL-E counterpart anytime soon, I still would love to use my system with weights that can output a wide variety of things. So what are the biggest pretrained GAN models out there? Did anyone try to train one on LAION-400-M or similar? submitted by /u/_chinatown [link] [comments]  ( 1 min )
    [D] What are some ways to improve your paper for a top-tier ML conference?
    ^^^ I'm sure this is a really broad question, with no one-size-fits-all answer. I am an undergrad who is very new to this research field and still don't have a good sense of what constitutes a well-written, top-conference-worthy paper, so I would welcome any thoughts. From what I've seen, some of the best-written papers have the following: - A very clear statement of the impact of their contribution, usually near the beginning. This tells the reader why they should care about this work (and to the reviewer why they should advocate for acceptance :D). - Usually a compelling teaser image in the second column of the first page (this is also good for Twitter publicity in the future). - Not sure if this is specific to my subfield (NLP), but the main results are enumerated and bolded at the …  ( 5 min )
    Calling all Developers [P]
    Wanna help people get away from being frustrated by those pesky menu bots that run you through different options. Looking for an Alexa like AI voice assistant that can solve problems it has learned before and send the problems they don't know to customer service. If interested comment below or if you think its dumb also let me know all views welcome here. submitted by /u/Rhiquire [link] [comments]
  • Open

    Classification & Generation of Microscopy Images for Malaria via Artificial Neural Networks in Ghana
    submitted by /u/pasticciociccio [link] [comments]
    A.I. Farms?
    While looking through some dall-e 2 pictures I thought to myself, are there companies building ai farms in a way? Basically taking requests from other companies to train an ai to do a certain thing. That could be a money maker but as always someone probably already thought of that? submitted by /u/Swaggyswaggerson [link] [comments]
    Python AI for files
    So I got that exercise, where I get like 20 files with 5000 values on 6 columns (Acceleration X,Y,Z and Gyro X,Y,Z). The exercise is to code an AI for a smart watch, to be able to detect, if the user fell down or is just shaking his hand for example. Each of the 20 files either represents a fall, or no fall. I've to code and train that AI on python, but have no Idea how. I dont have the 20 files yet, but I have to code like a template, where I can just insert the files and the moment I get them, I should be able to train the AI. At first I've mistaken that exercise and thought, I would get only 6 Values for one row and coded the AI like you see it in the picture. But then I realized, if you fall, you dont have a specific of acceleration and gyro, but rather an intervall of values. Then I remembered, that each of the files describe a recording for 5 seconds, where every 10 miliseconds one value got recorded and written into the file. But I really dont have any Idea, how I compare files with 5000 values on 6 columns, or rather write an AI, which can predict, when getting such a file, if its a fall or not. submitted by /u/Specific-Feeling-666 [link] [comments]  ( 2 min )
    Here's a repository where I try to keep up with the most interesting research papers of 2022. It is a curated list of the latest breakthroughs in AI and Data Science by release date with a clear video explanation, link to a more in-depth article, and code (if applicable).
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 1 min )
    Last Week in AI: New methods to detect deepfakes, AI for smarter farming, AI to detect heart problems from Apple Watch, and more!
    submitted by /u/regalalgorithm [link] [comments]
    This deep learning technique solves one of the tough challenges of robotics
    submitted by /u/bendee983 [link] [comments]
    Researchers at the Imperial College London have shown it is possible to perform artificial intelligence using tiny nanomagnets that interact like neurons in the brain.
    submitted by /u/DutchTechJunkie [link] [comments]  ( 1 min )
    Introduction to Tensorflow.js with Real-World Example
    submitted by /u/RubiksCodeNMZ [link] [comments]
    Google AI Introduces A Method For Automating Inter- And Intra-Operator Parallelism For Distributed Deep Learning
    The memory capacity of single accelerators has swiftly overtaken the rapidly rising size of deep learning models in recent years. Earlier models, such as BERT (with a parameter size of 1GB), may scale across accelerators quickly by utilizing data parallelism, which duplicates model weights across accelerators while splitting and distributing training data. Recent huge models, such as GPT-3 (with a parameter size of 175GB), can only be scaled through model parallelism training, which involves partitioning a single model across many machines. While model parallelism solutions allow for the training of significant models, they are more challenging to implement since they must be tailored to the target neural networks and computing clusters. Megatron-LM, for example, splits weight matrices by rows or columns and then synchronizes the results across devices using a model parallelism technique. Different operators in a neural network are divided into several groups, and the input data is divided into micro-batches that are run in a pipelined method. Continue Reading Paper: https://arxiv.org/pdf/2201.12023.pdf Code: https://github.com/alpa-projects/alpa https://i.redd.it/6mu4lyihbey81.gif submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Does anybody has at least some idea what this ai is called 🐶😳😏✨
    submitted by /u/p0goniphaft111 [link] [comments]  ( 1 min )
  • Open

    PPO exploration on hidden layers
    Hi everyone i have a question about PPO, but it can be extended to any stochastic policy. Imagine I have a DNN in which the first part is an MLP with 2 outputs, + 1 more additional layer (whatever it is). Usually PPO does exploration on the DNN final output by sampling the stochastic policy. The problem is that for my particular application i need the final layer to be deterministic without exploration, still i would like to explore on the 2 MLP output neurons MLP. Does this mess up with PPO loss? (log probabilities and entropy). submitted by /u/Tricky_Ad_853 [link] [comments]  ( 1 min )
    Whats the best RL lib?
    Whats the best RL lib or in general whats the best approach to write distributed RL like R2D2 etc? 1) stable baselines 2) Ray RLlib 3) Other 4) Write from scratch? What do you guys use to implement distributed RL? submitted by /u/Defiant_Sun5579 [link] [comments]  ( 1 min )
    When you have a recurrent policy, should you reset the hidden states at the end of each episode or after each step?
    submitted by /u/No_Possibility_7588 [link] [comments]  ( 1 min )
    Using CNNs in Reinforcement Learning
    For the last 18 months, I have been dabbling in DRL for both my MSc thesis and now also my PhD. I have however only worked with fixed size, 1 dimensional, state arrays for all of my problems and have used both baseline algorithms and adapted one version of SAC to better suit my problem needs. Note that even though I really like the main working principles of DRL and would love to dive deeper into the fundamentals, I do not have the required schooling in either mathematics or computer science to really afford going into that direction. Now currently I am working on a problem for which I think either CNNs or GNNs might be very useful but most of the implementations I can find only use the same network structure for the CNNs for the Actor and Critic (& Value) Networks, but they don't seem to be shared. Does this mean that all of these networks train different CNNs for similar purposes? Wouldn't it make sense if the CNN layers are shared for all of the networks, as in principle they share the same purpose: to transform the multi-dimensional input to a 1D input that contains sufficient information for the FC layers. It might be that I understand this completely incorrectly and this is already the case, and if its not the case, could someone explain to me why you wouldn't want to share these layers between the different networks? submitted by /u/jangroterder [link] [comments]  ( 3 min )
    Changing the learning rate for DQN agent in stable baselines 3
    Hi guys, I am pretty new to the field of reinforcement learning and I am going through it everyday. I made a custom environment with one state and 3 actions. I am trying to solve it with the dqn but it is taking too much time for example like 600k timesteps to get to the max result. So I thought of changing the learning rate. I could change the learning rate for PPO but it doesn't seem to be working for DQN. It would be really great if someone can help. Thanks in advance submitted by /u/last_2_brain_cells97 [link] [comments]  ( 1 min )
  • Open

    Logarithmic spiral
    I’ve seen an image similar to the following many times, but I don’t recall any source going into detail regarding how the spiral is constructed. This post will do just that. The previous post constructed iterated golden rectangles. We start with a golden rectangle and imagine chopping of first the blue, then the green, then […] Logarithmic spiral first appeared on John D. Cook.  ( 2 min )
    Iterated golden rectangles in detail
    I’ve seen the illustration of nesting golden rectangles many times, but I’ve never seen a presentation go into much detail. This post will go into more detail than usual, including Python code. Start with a golden rectangle in landscape mode. We’ll plot our rectangle with the lower left corner at the origin and the upper […] Iterated golden rectangles in detail first appeared on John D. Cook.  ( 2 min )
  • Open

    Utilize AWS AI services to automate content moderation and compliance
    The daily volume of third-party and user-generated content (UGC) across industries is increasing exponentially. Startups, social media, gaming, and other industries must ensure their customers are protected, while keeping operational costs down. Businesses in the broadcasting and media industries often find it difficult to efficiently add ratings to content pieces and formats to comply with […]  ( 6 min )
    Content moderation design patterns with AWS managed AI services
    User-generated content (UGC) grows exponentially, as well as the requirements and the cost to keep content and online communities safe and compliant. Modern web and mobile platforms fuel businesses and drive user engagement through social features, from startups to large organizations. Online community members expect safe and inclusive experiences where they can freely consume and […]  ( 8 min )
  • Open

    More Freedom on the Freeway: AI Lifts Malaysia’s Toll Barriers
    Working as an aerospace engineer in Malaysia, Chee How Lim dreamed of building a startup that could really take off. Today his company, Tapway, is riding a wave of computer vision and AI adoption in Southeast Asia. A call for help in 2019 with video analytics led to the Kuala Lumpur-based company’s biggest project to Read article > The post More Freedom on the Freeway: AI Lifts Malaysia’s Toll Barriers appeared first on NVIDIA Blog.  ( 3 min )
  • Open

    Q&A: Chris Rackauckas on the equations at the heart of practically everything
    Have a question about numerical differential equations? Odds are this CSAIL research affiliate has already addressed it.  ( 6 min )
  • Open

    Should the definition of Digital twins include simulation of complex systems?
    In the previous post, we discussed the various definitions of digital twins and we see that there are no shortage of them! But here is a question we discussed in class Should the definition of Digital twins include simulation of complex systems?  In my opinion, simulation is the raison d’être for digital twins let me explain (some of the ideas… Read More »Should the definition of Digital twins include simulation of complex systems? The post Should the definition of Digital twins include simulation of complex systems? appeared first on Data Science Central.  ( 2 min )
  • Open

    Introduction to Tensorflow.js with Real-World Example
    submitted by /u/RubiksCodeNMZ [link] [comments]
  • Open

    Static Analyzers in Python
    Static analyzers are tools that help you check your code without really running your code. The most basic form of static analyzers is the syntax highlighters in your favorite editors. If you need to compile your code (say, in C++), your compiler, such as LLVM, may also provide some static analyzer functions to warn you […] The post Static Analyzers in Python appeared first on Machine Learning Mastery.  ( 12 min )
  • Open

    Disentangled and Side-aware Unsupervised Domain Adaptation for Cross-dataset Subjective Tinnitus Diagnosis. (arXiv:2205.03230v1 [eess.SP])
    EEG-based tinnitus classification is a valuable tool for tinnitus diagnosis, research, and treatments. Most current works are limited to a single dataset where data patterns are similar. But EEG signals are highly non-stationary, resulting in model's poor generalization to new users, sessions or datasets. Thus, designing a model that can generalize to new datasets is beneficial and indispensable. To mitigate distribution discrepancy across datasets, we propose to achieve Disentangled and Side-aware Unsupervised Domain Adaptation (DSUDA) for cross-dataset tinnitus diagnosis. A disentangled auto-encoder is developed to decouple class-irrelevant information from the EEG signals to improve the classifying ability. The side-aware unsupervised domain adaptation module adapts the class-irrelevant information as domain variance to a new dataset and excludes the variance to obtain the class-distill features for the new dataset classification. It also align signals of left and right ears to overcome inherent EEG pattern difference. We compare DSUDA with state-of-the-art methods, and our model achieves significant improvements over competitors regarding comprehensive evaluation criteria. The results demonstrate our model can successfully generalize to a new dataset and effectively diagnose tinnitus.
    Predicting Loose-Fitting Garment Deformations Using Bone-Driven Motion Networks. (arXiv:2205.01355v2 [cs.GR] UPDATED)
    We present a learning algorithm that uses bone-driven motion networks to predict the deformation of loose-fitting garment meshes at interactive rates. Given a garment, we generate a simulation database and extract virtual bones from simulated mesh sequences using skin decomposition. At runtime, we separately compute low- and high-frequency deformations in a sequential manner. The low-frequency deformations are predicted by transferring body motions to virtual bones' motions, and the high-frequency deformations are estimated leveraging the global information of virtual bones' motions and local information extracted from low-frequency meshes. In addition, our method can estimate garment deformations caused by variations of the simulation parameters (e.g., fabric's bending stiffness) using an RBF kernel ensembling trained networks for different sets of simulation parameters. Through extensive comparisons, we show that our method outperforms state-of-the-art methods in terms of prediction accuracy of mesh deformations by about 20% in RMSE and 10% in Hausdorff distance and STED. The code and data are available at https://github.com/non-void/VirtualBones.
    Solar: $L_0$ solution path averaging for fast and accurate variable selection in high-dimensional data. (arXiv:2007.15707v3 [stat.ML] UPDATED)
    We propose a new variable selection algorithm, subsample-ordered least-angle regression (solar), and its coordinate descent generalization, solar-cd. Solar re-constructs lasso paths using the $L_0$ norm and averages the resulting solution paths across subsamples. Path averaging retains the ranking information of the informative variables while averaging out sensitivity to high dimensionality, improving variable selection stability, efficiency, and accuracy. We prove that: (i) with a high probability, path averaging perfectly separates informative variables from redundant variables on the average $L_0$ path; (ii) solar variable selection is consistent and accurate; and (iii) the probability that solar omits weak signals is controllable for finite sample size. We also demonstrate that: (i) solar yields, with less than $1/3$ of the lasso computation load, substantial improvements over lasso in terms of the sparsity (64-84\% reduction in redundant variable selection) and accuracy of variable selection; (ii) compared with the lasso safe/strong rule and variable screening, solar largely avoids selection of redundant variables and rejection of informative variables in the presence of complicated dependence structures; (iii) the sparsity and stability of solar conserves residual degrees of freedom for data-splitting hypothesis testing, improving the accuracy of post-selection inference on weak signals with limited $n$; (iv) replacing lasso with solar in bootstrap selection (e.g., bolasso or stability selection) produces a multi-layer variable ranking scheme that improves selection sparsity and ranking accuracy with the computation load of only one lasso realization; and (v) given the computation resources, solar bootstrap selection is substantially faster (98\% lower computation time) than the theoretical maximum speedup for parallelized bootstrap lasso (confirmed by Amdahl's law).
    A Framework for Evaluating Post Hoc Feature-Additive Explainers. (arXiv:2106.08376v2 [cs.LG] UPDATED)
    Many applications of data-driven models demand transparency of decisions, especially in health care, criminal justice, and other high-stakes environments. Modern trends in machine learning research have led to algorithms that are increasingly intricate to the degree that they are considered to be black boxes. In an effort to reduce the opacity of decisions, methods have been proposed to construe the inner workings of such models in a human-comprehensible manner. These post hoc techniques are described as being universal explainers - capable of faithfully augmenting decisions with algorithmic insight. Unfortunately, there is little agreement about what constitutes a "good" explanation. Moreover, current methods of explanation evaluation are derived from either subjective or proxy means. In this work, we propose a framework for the evaluation of post hoc explainers on ground truth that is directly derived from the additive structure of a model. We demonstrate the efficacy of the framework in understanding explainers by evaluating popular explainers on thousands of synthetic and several real-world tasks. The framework unveils that explanations may be accurate but misattribute the importance of individual features.
    Efficient and passive learning of networked dynamical systems driven by non-white exogenous inputs. (arXiv:2110.00852v3 [cs.LG] UPDATED)
    We consider a networked linear dynamical system with $p$ agents/nodes. We study the problem of learning the underlying graph of interactions/dependencies from observations of the nodal trajectories over a time-interval $T$. We present a regularized non-casual consistent estimator for this problem and analyze its sample complexity over two regimes: (a) where the interval $T$ consists of $n$ i.i.d. observation windows of length $T/n$ (restart and record), and (b) where $T$ is one continuous observation window (consecutive). Using the theory of $M$-estimators, we show that the estimator recovers the underlying interactions, in either regime, in a time-interval that is logarithmic in the system size $p$. To the best of our knowledge, this is the first work to analyze the sample complexity of learning linear dynamical systems \emph{driven by unobserved not-white wide-sense stationary (WSS) inputs}.  ( 2 min )
    Learning Optimal Conformal Classifiers. (arXiv:2110.09192v3 [cs.LG] UPDATED)
    Modern deep learning based classifiers show very high accuracy on test data but this does not provide sufficient guarantees for safe deployment, especially in high-stake AI applications such as medical diagnosis. Usually, predictions are obtained without a reliable uncertainty estimate or a formal guarantee. Conformal prediction (CP) addresses these issues by using the classifier's predictions, e.g., its probability estimates, to predict confidence sets containing the true class with a user-specified probability. However, using CP as a separate processing step after training prevents the underlying model from adapting to the prediction of confidence sets. Thus, this paper explores strategies to differentiate through CP during training with the goal of training model with the conformal wrapper end-to-end. In our approach, conformal training (ConfTr), we specifically "simulate" conformalization on mini-batches during training. Compared to standard training, ConfTr reduces the average confidence set size (inefficiency) of state-of-the-art CP methods applied after training. Moreover, it allows to "shape" the confidence sets predicted at test time, which is difficult for standard CP. On experiments with several datasets, we show ConfTr can influence how inefficiency is distributed across classes, or guide the composition of confidence sets in terms of the included classes, while retaining the guarantees offered by CP.  ( 2 min )
    Differentially private training of residual networks with scale normalisation. (arXiv:2203.00324v2 [cs.LG] UPDATED)
    The training of neural networks with Differentially Private Stochastic Gradient Descent offers formal Differential Privacy guarantees but introduces accuracy trade-offs. In this work, we propose to alleviate these trade-offs in residual networks with Group Normalisation through a simple architectural modification termed ScaleNorm by which an additional normalisation layer is introduced after the residual block's addition operation. Our method allows us to further improve on the recently reported state-of-the art on CIFAR-10, achieving a top-1 accuracy of 82.5% ({\epsilon}=8.0) when trained from scratch.  ( 2 min )
    Active Offline Policy Selection. (arXiv:2106.10251v4 [cs.LG] UPDATED)
    This paper addresses the problem of policy selection in domains with abundant logged data, but with a restricted interaction budget. Solving this problem would enable safe evaluation and deployment of offline reinforcement learning policies in industry, robotics, and recommendation domains among others. Several off-policy evaluation (OPE) techniques have been proposed to assess the value of policies using only logged data. However, there is still a big gap between the evaluation by OPE and the full online evaluation. Yet, large amounts of online interactions are often not possible in practice. To overcome this problem, we introduce active offline policy selection - a novel sequential decision approach that combines logged data with online interaction to identify the best policy. We use OPE estimates to warm start the online evaluation. Then, in order to utilize the limited environment interactions wisely we decide which policy to evaluate next based on a Bayesian optimization method with a kernel that represents policy similarity. We use multiple benchmarks, including real-world robotics, with a large number of candidate policies to show that the proposed approach improves upon state-of-the-art OPE estimates and pure online policy evaluation.  ( 2 min )
    What Makes A Good Fisherman? Linear Regression under Self-Selection Bias. (arXiv:2205.03246v1 [math.ST])
    In the classical setting of self-selection, the goal is to learn $k$ models, simultaneously from observations $(x^{(i)}, y^{(i)})$ where $y^{(i)}$ is the output of one of $k$ underlying models on input $x^{(i)}$. In contrast to mixture models, where we observe the output of a randomly selected model, here the observed model depends on the outputs themselves, and is determined by some known selection criterion. For example, we might observe the highest output, the smallest output, or the median output of the $k$ models. In known-index self-selection, the identity of the observed model output is observable; in unknown-index self-selection, it is not. Self-selection has a long history in Econometrics and applications in various theoretical and applied fields, including treatment effect estimation, imitation learning, learning from strategically reported data, and learning from markets at disequilibrium. In this work, we present the first computationally and statistically efficient estimation algorithms for the most standard setting of this problem where the models are linear. In the known-index case, we require poly$(1/\varepsilon, k, d)$ sample and time complexity to estimate all model parameters to accuracy $\varepsilon$ in $d$ dimensions, and can accommodate quite general selection criteria. In the more challenging unknown-index case, even the identifiability of the linear models (from infinitely many samples) was not known. We show three results in this case for the commonly studied $\max$ self-selection criterion: (1) we show that the linear models are indeed identifiable, (2) for general $k$ we provide an algorithm with poly$(d) \exp(\text{poly}(k))$ sample and time complexity to estimate the regression parameters up to error $1/\text{poly}(k)$, and (3) for $k = 2$ we provide an algorithm for any error $\varepsilon$ and poly$(d, 1/\varepsilon)$ sample and time complexity.  ( 2 min )
    Federated Channel Learning for Intelligent Reflecting Surfaces With Fewer Pilot Signals. (arXiv:2205.03196v1 [eess.SP])
    Channel estimation is a critical task in intelligent reflecting surface (IRS)-assisted wireless systems due to the uncertainties imposed by environment dynamics and rapid changes in the IRS configuration. To deal with these uncertainties, deep learning (DL) approaches have been proposed. Previous works consider centralized learning (CL) approach for model training, which entails the collection of the whole training dataset from the users at the base station (BS), hence introducing huge transmission overhead for data collection. To address this challenge, this paper proposes a federated learning (FL) framework to jointly estimate both direct and cascaded channels in IRS-assisted wireless systems. We design a single convolutional neural network trained on the local datasets of the users without sending them to the BS. We show that the proposed FL-based channel estimation approach requires approximately 60% fewer pilot signals and it exhibits 12 times lower transmission overhead than CL, while maintaining satisfactory performance close to CL. In addition, it provides lower estimation error than the state-of-the-art DL-based schemes.  ( 2 min )
    Electrocardiographic Deep Learning for Predicting Post-Procedural Mortality. (arXiv:2205.03242v1 [eess.SP])
    Background. Pre-operative risk assessments used in clinical practice are limited in their ability to identify risk for post-operative mortality. We hypothesize that electrocardiograms contain hidden risk markers that can help prognosticate post-operative mortality. Methods. In a derivation cohort of 45,969 pre-operative patients (age 59+- 19 years, 55 percent women), a deep learning algorithm was developed to leverage waveform signals from pre-operative ECGs to discriminate post-operative mortality. Model performance was assessed in a holdout internal test dataset and in two external hospital cohorts and compared with the Revised Cardiac Risk Index (RCRI) score. Results. In the derivation cohort, there were 1,452 deaths. The algorithm discriminates mortality with an AUC of 0.83 (95% CI 0.79-0.87) surpassing the discrimination of the RCRI score with an AUC of 0.67 (CI 0.61-0.72) in the held out test cohort. Patients determined to be high risk by the deep learning model's risk prediction had an unadjusted odds ratio (OR) of 8.83 (5.57-13.20) for post-operative mortality as compared to an unadjusted OR of 2.08 (CI 0.77-3.50) for post-operative mortality for RCRI greater than 2. The deep learning algorithm performed similarly for patients undergoing cardiac surgery with an AUC of 0.85 (CI 0.77-0.92), non-cardiac surgery with an AUC of 0.83 (0.79-0.88), and catherization or endoscopy suite procedures with an AUC of 0.76 (0.72-0.81). The algorithm similarly discriminated risk for mortality in two separate external validation cohorts from independent healthcare systems with AUCs of 0.79 (0.75-0.83) and 0.75 (0.74-0.76) respectively. Conclusion. The findings demonstrate how a novel deep learning algorithm, applied to pre-operative ECGs, can improve discrimination of post-operative mortality.
    Convex Analysis at Infinity: An Introduction to Astral Space. (arXiv:2205.03260v1 [math.OC])
    Not all convex functions on $\mathbb{R}^n$ have finite minimizers; some can only be minimized by a sequence as it heads to infinity. In this work, we aim to develop a theory for understanding such minimizers at infinity. We study astral space, a compact extension of $\mathbb{R}^n$ to which such points at infinity have been added. Astral space is constructed to be as small as possible while still ensuring that all linear functions can be continuously extended to the new space. Although astral space includes all of $\mathbb{R}^n$, it is not a vector space, nor even a metric space. However, it is sufficiently well-structured to allow useful and meaningful extensions of concepts of convexity, conjugacy, and subdifferentials. We develop these concepts and analyze various properties of convex functions on astral space, including the detailed structure of their minimizers, exact characterizations of continuity, and convergence of descent algorithms.  ( 2 min )
    Implementation of a Binary Neural Network on a Passive Array of Magnetic Tunnel Junctions. (arXiv:2112.09159v2 [cs.ET] UPDATED)
    The increasing scale of neural networks and their growing application space have produced demand for more energy- and memory-efficient artificial-intelligence-specific hardware. Avenues to mitigate the main issue, the von Neumann bottleneck, include in-memory and near-memory architectures, as well as algorithmic approaches. Here we leverage the low-power and the inherently binary operation of magnetic tunnel junctions (MTJs) to demonstrate neural network hardware inference based on passive arrays of MTJs. In general, transferring a trained network model to hardware for inference is confronted by degradation in performance due to device-to-device variations, write errors, parasitic resistance, and nonidealities in the substrate. To quantify the effect of these hardware realities, we benchmark 300 unique weight matrix solutions of a 2-layer perceptron to classify the Wine dataset for both classification accuracy and write fidelity. Despite device imperfections, we achieve software-equivalent accuracy of up to 95.3 % with proper tuning of network parameters in 15 x 15 MTJ arrays having a range of device sizes. The success of this tuning process shows that new metrics are needed to characterize the performance and quality of networks reproduced in mixed signal hardware.  ( 2 min )
    Fine-tuning wav2vec2 for speaker recognition. (arXiv:2109.15053v2 [cs.SD] UPDATED)
    This paper explores applying the wav2vec2 framework to speaker recognition instead of speech recognition. We study the effectiveness of the pre-trained weights on the speaker recognition task, and how to pool the wav2vec2 output sequence into a fixed-length speaker embedding. To adapt the framework to speaker recognition, we propose a single-utterance classification variant with CE or AAM softmax loss, and an utterance-pair classification variant with BCE loss. Our best performing variant, w2v2-aam, achieves a 1.88% EER on the extended voxceleb1 test set compared to 1.69% EER with an ECAPA-TDNN baseline. Code is available at https://github.com/nikvaessen/w2v2-speaker.  ( 2 min )
    Incremental Data-Uploading for Full-Quantum Classification. (arXiv:2205.03057v1 [quant-ph])
    The data representation in a machine-learning model strongly influences its performance. This becomes even more important for quantum machine learning models implemented on noisy intermediate scale quantum (NISQ) devices. Encoding high dimensional data into a quantum circuit for a NISQ device without any loss of information is not trivial and brings a lot of challenges. While simple encoding schemes (like single qubit rotational gates to encode high dimensional data) often lead to information loss within the circuit, complex encoding schemes with entanglement and data re-uploading lead to an increase in the encoding gate count. This is not well-suited for NISQ devices. This work proposes 'incremental data-uploading', a novel encoding pattern for high dimensional data that tackles these challenges. We spread the encoding gates for the feature vector of a given data point throughout the quantum circuit with parameterized gates in between them. This encoding pattern results in a better representation of data in the quantum circuit with a minimal pre-processing requirement. We show the efficiency of our encoding pattern on a classification task using the MNIST and Fashion-MNIST datasets, and compare different encoding methods via classification accuracy and the effective dimension of the model.  ( 2 min )
    Gaussian Processes for Missing Value Imputation. (arXiv:2204.04648v2 [stat.ML] UPDATED)
    Missing values are common in many real-life datasets. However, most of the current machine learning methods can not handle missing values. This means that they should be imputed beforehand. Gaussian Processes (GPs) are non-parametric models with accurate uncertainty estimates that combined with sparse approximations and stochastic variational inference scale to large data sets. Sparse GPs can be used to compute a predictive distribution for missing data. Here, we present a hierarchical composition of sparse GPs that is used to predict missing values at each dimension using all the variables from the other dimensions. We call the approach missing GP (MGP). MGP can be trained simultaneously to impute all observed missing values. Specifically, it outputs a predictive distribution for each missing value that is then used in the imputation of other missing values. We evaluate MGP in one private clinical data set and four UCI datasets with a different percentage of missing values. We compare the performance of MGP with other state-of-the-art methods for imputing missing values, including variants based on sparse GPs and deep GPs. The results obtained show a significantly better performance of MGP.  ( 2 min )
    Defending against Reconstruction Attacks through Differentially Private Federated Learning for Classification of Heterogeneous Chest X-Ray Data. (arXiv:2205.03168v1 [cs.LG])
    Privacy regulations and the physical distribution of heterogeneous data are often primary concerns for the development of deep learning models in a medical context. This paper evaluates the feasibility of differentially private federated learning for chest X-ray classification as a defense against privacy attacks on DenseNet121 and ResNet50 network architectures. We simulated a federated environment by distributing images from the public CheXpert and Mendeley chest X-ray datasets unevenly among 36 clients. Both non-private baseline models achieved an area under the ROC curve (AUC) of 0.94 on the binary classification task of detecting the presence of a medical finding. We demonstrate that both model architectures are vulnerable to privacy violation by applying image reconstruction attacks to local model updates from individual clients. The attack was particularly successful during later training stages. To mitigate the risk of privacy breach, we integrated R\'enyi differential privacy with a Gaussian noise mechanism into local model training. We evaluate model performance and attack vulnerability for privacy budgets $\epsilon \in$ {1, 3, 6, 10}. The DenseNet121 achieved the best utility-privacy trade-off with an AUC of 0.94 for $\epsilon$ = 6. Model performance deteriorated slightly for individual clients compared to the non-private baseline. The ResNet50 only reached an AUC of 0.76 in the same privacy setting. Its performance was inferior to that of the DenseNet121 for all considered privacy constraints, suggesting that the DenseNet121 architecture is more robust to differentially private training.  ( 2 min )
    Side-aware Meta-Learning for Cross-Dataset Listener Diagnosis with Subjective Tinnitus. (arXiv:2205.03231v1 [eess.SP])
    With the development of digital technology, machine learning has paved the way for the next generation of tinnitus diagnoses. Although machine learning has been widely applied in EEG-based tinnitus analysis, most current models are dataset-specific. Each dataset may be limited to a specific range of symptoms, overall disease severity, and demographic attributes; further, dataset formats may differ, impacting model performance. This paper proposes a side-aware meta-learning for cross-dataset tinnitus diagnosis, which can effectively classify tinnitus in subjects of divergent ages and genders from different data collection processes. Owing to the superiority of meta-learning, our method does not rely on large-scale datasets like conventional deep learning models. Moreover, we design a subject-specific training process to assist the model in fitting the data pattern of different patients or healthy people. Our method achieves a high accuracy of 73.8\% in the cross-dataset classification. We conduct an extensive analysis to show the effectiveness of side information of ears in enhancing model performance and side-aware meta-learning in improving the quality of the learned features.
    IMU Based Deep Stride Length Estimation With Self-Supervised Learning. (arXiv:2205.02977v1 [cs.LG])
    Stride length estimation using inertial measurement unit (IMU) sensors is getting popular recently as one representative gait parameter for health care and sports training. The traditional estimation method requires some explicit calibrations and design assumptions. Current deep learning methods suffer from few labeled data problem. To solve above problems, this paper proposes a single convolutional neural network (CNN) model to predict stride length of running and walking and classify the running or walking type per stride. The model trains its pretext task with self-supervised learning on a large unlabeled dataset for feature learning, and its downstream task on the stride length estimation and classification tasks with supervised learning with a small labeled dataset. The proposed model can achieve better average percent error, 4.78\%, on running and walking stride length regression and 99.83\% accuracy on running and walking classification, when compared to the previous approach, 7.44\% on the stride length estimation.  ( 2 min )
    Symphony: Learning Realistic and Diverse Agents for Autonomous Driving Simulation. (arXiv:2205.03195v1 [cs.LG])
    Simulation is a crucial tool for accelerating the development of autonomous vehicles. Making simulation realistic requires models of the human road users who interact with such cars. Such models can be obtained by applying learning from demonstration (LfD) to trajectories observed by cars already on the road. However, existing LfD methods are typically insufficient, yielding policies that frequently collide or drive off the road. To address this problem, we propose Symphony, which greatly improves realism by combining conventional policies with a parallel beam search. The beam search refines these policies on the fly by pruning branches that are unfavourably evaluated by a discriminator. However, it can also harm diversity, i.e., how well the agents cover the entire distribution of realistic behaviour, as pruning can encourage mode collapse. Symphony addresses this issue with a hierarchical approach, factoring agent behaviour into goal generation and goal conditioning. The use of such goals ensures that agent diversity neither disappears during adversarial training nor is pruned away by the beam search. Experiments on both proprietary and open Waymo datasets confirm that Symphony agents learn more realistic and diverse behaviour than several baselines.  ( 2 min )
    Trainable Wavelet Neural Network for Non-Stationary Signals. (arXiv:2205.03355v1 [cs.LG])
    This work introduces a wavelet neural network to learn a filter-bank specialized to fit non-stationary signals and improve interpretability and performance for digital signal processing. The network uses a wavelet transform as the first layer of a neural network where the convolution is a parameterized function of the complex Morlet wavelet. Experimental results, on both simplified data and atmospheric gravity waves, show the network is quick to converge, generalizes well on noisy data, and outperforms standard network architectures.
    Meta-Knowledge Transfer for Inductive Knowledge Graph Embedding. (arXiv:2110.14170v3 [cs.LG] UPDATED)
    Knowledge graphs (KGs) consisting of a large number of triples have become widespread recently, and many knowledge graph embedding (KGE) methods are proposed to embed entities and relations of a KG into continuous vector spaces. Such embedding methods simplify the operations of conducting various in-KG tasks (e.g., link prediction) and out-of-KG tasks (e.g., question answering). They can be viewed as general solutions for representing KGs. However, existing KGE methods are not applicable to inductive settings, where a model trained on source KGs will be tested on target KGs with entities unseen during model training. Existing works focusing on KGs in inductive settings can only solve the inductive relation prediction task. They can not handle other out-of-KG tasks as general as KGE methods since they don't produce embeddings for entities. In this paper, to achieve inductive knowledge graph embedding, we propose a model MorsE, which does not learn embeddings for entities but learns transferable meta-knowledge that can be used to produce entity embeddings. Such meta-knowledge is modeled by entity-independent modules and learned by meta-learning. Experimental results show that our model significantly outperforms corresponding baselines for in-KG and out-of-KG tasks in inductive settings.  ( 2 min )
    Synthetic Data -- what, why and how?. (arXiv:2205.03257v1 [cs.LG])
    This explainer document aims to provide an overview of the current state of the rapidly expanding work on synthetic data technologies, with a particular focus on privacy. The article is intended for a non-technical audience, though some formal definitions have been given to provide clarity to specialists. This article is intended to enable the reader to quickly become familiar with the notion of synthetic data, as well as understand some of the subtle intricacies that come with it. We do believe that synthetic data is a very useful tool, and our hope is that this report highlights that, while drawing attention to nuances that can easily be overlooked in its deployment.  ( 2 min )
    Application of Clustering Algorithms for Dimensionality Reduction in Infrastructure Resilience Prediction Models. (arXiv:2205.03316v1 [cs.LG])
    Recent studies increasingly adopt simulation-based machine learning (ML) models to analyze critical infrastructure system resilience. For realistic applications, these ML models consider the component-level characteristics that influence the network response during emergencies. However, such an approach could result in a large number of features and cause ML models to suffer from the `curse of dimensionality'. We present a clustering-based method that simultaneously minimizes the problem of high-dimensionality and improves the prediction accuracy of ML models developed for resilience analysis in large-scale interdependent infrastructure networks. The methodology has three parts: (a) generation of simulation dataset, (b) network component clustering, and (c) dimensionality reduction and development of prediction models. First, an interdependent infrastructure simulation model simulates the network-wide consequences of various disruptive events. The component-level features are extracted from the simulated data. Next, clustering algorithms are used to derive the cluster-level features by grouping component-level features based on their topological and functional characteristics. Finally, ML algorithms are used to develop models that predict the network-wide impacts of disruptive events using the cluster-level features. The applicability of the method is demonstrated using an interdependent power-water-transport testbed. The proposed method can be used to develop decision-support tools for post-disaster recovery of infrastructure networks.
    Physics-informed neural networks for PDE-constrained optimization and control. (arXiv:2205.03377v1 [cs.LG])
    A fundamental problem of science is designing optimal control policies that manipulate a given environment into producing a desired outcome. Control Physics-Informed Neural Networks simultaneously solve a given system state, and its respective optimal control, in a one-stage framework that conforms to physical laws of the system. Prior approaches use a two-stage framework that models and controls a system sequentially, whereas Control PINNs incorporates the required optimality conditions in its architecture and loss function. The success of Control PINNs is demonstrated by solving the following open-loop optimal control problems: (i) an analytical problem (ii) a one-dimensional heat equation, and (iii) a two-dimensional predator-prey problem.  ( 2 min )
    Generative Adversarial Neural Operators. (arXiv:2205.03017v1 [cs.LG])
    We propose the generative adversarial neural operator (GANO), a generative model paradigm for learning probabilities on infinite-dimensional function spaces. The natural sciences and engineering are known to have many types of data that are sampled from infinite-dimensional function spaces, where classical finite-dimensional deep generative adversarial networks (GANs) may not be directly applicable. GANO generalizes the GAN framework and allows for the sampling of functions by learning push-forward operator maps in infinite-dimensional spaces. GANO consists of two main components, a generator neural operator and a discriminator neural functional. The inputs to the generator are samples of functions from a user-specified probability measure, e.g., Gaussian random field (GRF), and the generator outputs are synthetic data functions. The input to the discriminator is either a real or synthetic data function. In this work, we instantiate GANO using the Wasserstein criterion and show how the Wasserstein loss can be computed in infinite-dimensional spaces. We empirically study GANOs in controlled cases where both input and output functions are samples from GRFs and compare its performance to the finite-dimensional counterpart GAN. We empirically study the efficacy of GANO on real-world function data of volcanic activities and show its superior performance over GAN. Furthermore, we find that for the function-based data considered, GANOs are more stable to train than GANs and require less hyperparameter optimization.
    Hausa Visual Genome: A Dataset for Multi-Modal English to Hausa Machine Translation. (arXiv:2205.01133v2 [cs.CL] UPDATED)
    Multi-modal Machine Translation (MMT) enables the use of visual information to enhance the quality of translations. The visual information can serve as a valuable piece of context information to decrease the ambiguity of input sentences. Despite the increasing popularity of such a technique, good and sizeable datasets are scarce, limiting the full extent of their potential. Hausa, a Chadic language, is a member of the Afro-Asiatic language family. It is estimated that about 100 to 150 million people speak the language, with more than 80 million indigenous speakers. This is more than any of the other Chadic languages. Despite a large number of speakers, the Hausa language is considered low-resource in natural language processing (NLP). This is due to the absence of sufficient resources to implement most NLP tasks. While some datasets exist, they are either scarce, machine-generated, or in the religious domain. Therefore, there is a need to create training and evaluation data for implementing machine learning tasks and bridging the research gap in the language. This work presents the Hausa Visual Genome (HaVG), a dataset that contains the description of an image or a section within the image in Hausa and its equivalent in English. To prepare the dataset, we started by translating the English description of the images in the Hindi Visual Genome (HVG) into Hausa automatically. Afterward, the synthetic Hausa data was carefully post-edited considering the respective images. The dataset comprises 32,923 images and their descriptions that are divided into training, development, test, and challenge test set. The Hausa Visual Genome is the first dataset of its kind and can be used for Hausa-English machine translation, multi-modal research, and image description, among various other natural language processing and generation tasks.  ( 3 min )
    Atlas-powered deep learning (ADL) -- application to diffusion weighted MRI. (arXiv:2205.03210v1 [physics.med-ph])
    Deep learning has a great potential for estimating biomarkers in diffusion weighted magnetic resonance imaging (dMRI). Atlases, on the other hand, are a unique tool for modeling the spatio-temporal variability of biomarkers. In this paper, we propose the first framework to exploit both deep learning and atlases for biomarker estimation in dMRI. Our framework relies on non-linear diffusion tensor registration to compute biomarker atlases and to estimate atlas reliability maps. We also use nonlinear tensor registration to align the atlas to a subject and to estimate the error of this alignment. We use the biomarker atlas, atlas reliability map, and alignment error map, in addition to the dMRI signal, as inputs to a deep learning model for biomarker estimation. We use our framework to estimate fractional anisotropy and neurite orientation dispersion from down-sampled dMRI data on a test cohort of 70 newborn subjects. Results show that our method significantly outperforms standard estimation methods as well as recent deep learning techniques. Our method is also more robust to stronger measurement down-sampling factors. Our study shows that the advantages of deep learning and atlases can be synergistically combined to achieve unprecedented accuracy in biomarker estimation from dMRI data.
    Efficient Minimax Optimal Estimators For Multivariate Convex Regression. (arXiv:2205.03368v1 [math.ST])
    We study the computational aspects of the task of multivariate convex regression in dimension $d \geq 5$. We present the first computationally efficient minimax optimal (up to logarithmic factors) estimators for the tasks of (i) $L$-Lipschitz convex regression (ii) $\Gamma$-bounded convex regression under polytopal support. The proof of the correctness of these estimators uses a variety of tools from different disciplines, among them empirical process theory, stochastic geometry, and potential theory. This work is the first to show the existence of efficient minimax optimal estimators for non-Donsker classes that their corresponding Least Squares Estimators are provably minimax sub-optimal; a result of independent interest.  ( 2 min )
    Optimal Control as Variational Inference. (arXiv:2205.03279v1 [cs.LG])
    In this article we address the stochastic and risk sensitive optimal control problem probabilistically and decompose and solve the probabilistic models using principles from variational inference. We demonstrate how this culminates into two separate probabilistic inference procedures that allow to iteratively infer the deterministic optimal policy. More formally a sequence of belief policies, as a probabilistic proxy for the deterministic optimal policy, is specified through a fixed point iteration with the equilibrium point coinciding with the deterministic solution. These results re-establish the paradigm of Control as Inference, a concept explored and exploited originally by the Reinforcement Learning community anticipating deep rooted connections between optimal estimation and control. Although the Control as Inference paradigm already resulted in the development of several Reinforcement Learning algorithms, until now the underlying mechanism were only partially understood. For that very reason control as inference has not been well received by the control community. By exposing the underlying mechanism we aim to contribute to its general acceptance as a framework superseding optimal control. In order to exhibit its general relevance we discuss parallels with path integral control and discuss a wide range of possible applications.  ( 2 min )
    A Highly Adaptive Acoustic Model for Accurate Multi-Dialect Speech Recognition. (arXiv:2205.03027v1 [cs.LG])
    Despite the success of deep learning in speech recognition, multi-dialect speech recognition remains a difficult problem. Although dialect-specific acoustic models are known to perform well in general, they are not easy to maintain when dialect-specific data is scarce and the number of dialects for each language is large. Therefore, a single unified acoustic model (AM) that generalizes well for many dialects has been in demand. In this paper, we propose a novel acoustic modeling technique for accurate multi-dialect speech recognition with a single AM. Our proposed AM is dynamically adapted based on both dialect information and its internal representation, which results in a highly adaptive AM for handling multiple dialects simultaneously. We also propose a simple but effective training method to deal with unseen dialects. The experimental results on large scale speech datasets show that the proposed AM outperforms all the previous ones, reducing word error rates (WERs) by 8.11% relative compared to a single all-dialects AM and by 7.31% relative compared to dialect-specific AMs.  ( 2 min )
    Predicting Chemical Hazard across Taxa through Machine Learning. (arXiv:2110.03688v3 [q-bio.QM] UPDATED)
    We applied machine learning methods to predict chemical hazards focusing on fish acute toxicity across taxa. We analyzed the relevance of taxonomy and experimental setup, showing that taking them into account can lead to considerable improvements in the classification performance. We quantified the gain obtained throught the introduction of taxonomic and experimental information, compared to classification based on chemical information alone. We used our approach with standard machine learning models (K-nearest neighbors, random forests and deep neural networks), as well as the recently proposed Read-Across Structure Activity Relationship (RASAR) models, which were very successful in predicting chemical hazards to mammals based on chemical similarity. We were able to obtain accuracies of over 93% on datasets where, due to noise in the data, the maximum achievable accuracy was expected to be below 96%. The best performances were obtained by random forests and RASAR models. We analyzed metrics to compare our results with animal test reproducibility, and despite most of our models "outperform animal test reproducibility" as measured through recently proposed metrics, we showed that the comparison between machine learning performance and animal test reproducibility should be addressed with particular care. While we focused on fish mortality, our approach, provided that the right data is available, is valid for any combination of chemicals, effects and taxa.  ( 2 min )
    Altering backward pass gradients improves convergence. (arXiv:2111.12495v2 [cs.LG] UPDATED)
    In typical neural network training, the gradients in the backward pass is determined by the forward pass. As a result, the two stages are coupled. However, it is often seen that neural networks perform worse when gradients explode or decline. To address this, numerous approaches like Gradient Clipping (GC) and Adaptive Gradient Clipping (AGC) have been developed to enhance the gradient behaviour of networks without normalization layers during backward passes. These techniques decouple the backward and forward passes and modify the gradients adaptively. A possible drawback of clipping approaches is that they must be calculated for each weight tensor in each layer. We offer the PowerGrad Transform (PGT), a comparable approach that alters and enhances the gradient flow behaviour in the backward pass but is calculated only in the final softmax layer. It is very computationally efficient and outperforms both GC and AGC, resulting in improved performance in networks without batch normalization. PGT is easy to integrate into existing networks, requiring just a few lines of code, and significantly increases performance in non-BN ResNets. The impact is more pronounced on big datasets like as ImageNet, when networks do not fit all of the training data and there is some training headroom. PGT makes it possible for the network to better fit the training data while simultaneously improving its performance on the test set.  ( 2 min )
    Longitudinal cardio-respiratory fitness prediction through free-living wearable sensors. (arXiv:2205.03116v1 [cs.LG])
    Cardiorespiratory fitness is an established predictor of metabolic disease and mortality. Fitness is directly measured as maximal oxygen consumption (VO2max), or indirectly assessed using heart rate response to a standard exercise test. However, such testing is costly and burdensome, limiting its utility and scalability. Fitness can also be approximated using resting heart rate and self-reported exercise habits but with lower accuracy. Modern wearables capture dynamic heart rate data which, in combination with machine learning models, could improve fitness prediction. In this work, we analyze movement and heart rate signals from wearable sensors in free-living conditions from 11,059 participants who also underwent a standard exercise test, along with a longitudinal repeat cohort of 2,675 participants. We design algorithms and models that convert raw sensor data into cardio-respiratory fitness estimates, and validate these estimates' ability to capture fitness profiles in a longitudinal cohort over time while subjects engaged in real-world (non-exercise) behaviour. Additionally, we validate our methods with a third external cohort of 181 participants who underwent maximal VO2max testing, which is considered the gold standard measurement because it requires reaching one's maximum heart rate and exhaustion level. Our results show that the developed models yield a high correlation (r = 0.82, 95CI 0.80-0.83), when compared to the ground truth in a holdout sample. These models outperform conventional non-exercise fitness models and traditional bio-markers using measurements of normal daily living without the need for a specific exercise test. Additionally, we show the adaptability and applicability of this approach for detecting fitness change over time in the longitudinal subsample that repeated measurements after 7 years.  ( 2 min )
    Offense Detection in Dravidian Languages using Code-Mixing Index based Focal Loss. (arXiv:2111.06916v2 [cs.CL] UPDATED)
    Over the past decade, we have seen exponential growth in online content fueled by social media platforms. Data generation of this scale comes with the caveat of insurmountable offensive content in it. The complexity of identifying offensive content is exacerbated by the usage of multiple modalities (image, language, etc.), code-mixed language and more. Moreover, even after careful sampling and annotation of offensive content, there will always exist a significant class imbalance between offensive and non-offensive content. In this paper, we introduce a novel Code-Mixing Index (CMI) based focal loss which circumvents two challenges (1) code-mixing in languages (2) class imbalance problem for Dravidian language offense detection. We also replace the conventional dot product-based classifier with the cosine-based classifier which results in a boost in performance. Further, we use multilingual models that help transfer characteristics learnt across languages to work effectively with low resourced languages. It is also important to note that our model handles instances of mixed script (say usage of Latin and Dravidian-Tamil script) as well. To summarize, our model can handle offensive language detection in a low-resource, class imbalanced, multilingual and code-mixed setting.  ( 2 min )
    SKILL-IL: Disentangling Skill and Knowledge in Multitask Imitation Learning. (arXiv:2205.03130v1 [cs.LG])
    In this work, we introduce a new perspective for learning transferable content in multi-task imitation learning. Humans are able to transfer skills and knowledge. If we can cycle to work and drive to the store, we can also cycle to the store and drive to work. We take inspiration from this and hypothesize the latent memory of a policy network can be disentangled into two partitions. These contain either the knowledge of the environmental context for the task or the generalizable skill needed to solve the task. This allows improved training efficiency and better generalization over previously unseen combinations of skills in the same environment, and the same task in unseen environments. We used the proposed approach to train a disentangled agent for two different multi-task IL environments. In both cases we out-performed the SOTA by 30% in task success rate. We also demonstrated this for navigation on a real robot.  ( 2 min )
    Echocardiography Segmentation with Enforced Temporal Consistency. (arXiv:2112.02102v2 [eess.IV] UPDATED)
    Convolutional neural networks (CNN) have demonstrated their ability to segment 2D cardiac ultrasound images. However, despite recent successes according to which the intra-observer variability on end-diastole and end-systole images has been reached, CNNs still struggle to leverage temporal information to provide accurate and temporally consistent segmentation maps across the whole cycle. Such consistency is required to accurately describe the cardiac function, a necessary step in diagnosing many cardiovascular diseases. In this paper, we propose a framework to learn the 2D+time apical long-axis cardiac shape such that the segmented sequences can benefit from temporal and anatomical consistency constraints. Our method is a post-processing that takes as input segmented echocardiographic sequences produced by any state-of-the-art method and processes it in two steps to (i) identify spatio-temporal inconsistencies according to the overall dynamics of the cardiac sequence and (ii) correct the inconsistencies. The identification and correction of cardiac inconsistencies relies on a constrained autoencoder trained to learn a physiologically interpretable embedding of cardiac shapes, where we can both detect and fix anomalies. We tested our framework on 98 full-cycle sequences from the CAMUS dataset, which are available alongside this paper. Our temporal regularization method not only improves the accuracy of the segmentation across the whole sequences, but also enforces temporal and anatomical consistency.  ( 2 min )
    Variance Reduction based Partial Trajectory Reuse to Accelerate Policy Gradient Optimization. (arXiv:2205.02976v1 [cs.LG])
    We extend the idea underlying the success of green simulation assisted policy gradient (GS-PG) to partial historical trajectory reuse for infinite-horizon Markov Decision Processes (MDP). The existing GS-PG method was designed to learn from complete episodes or process trajectories, which limits its applicability to low-data environment and online process control. In this paper, the mixture likelihood ratio (MLR) based policy gradient estimation is used to leverage the information from historical state decision transitions generated under different behavioral policies. We propose a variance reduction experience replay (VRER) approach that can intelligently select and reuse most relevant transition observations, improve the policy gradient estimation accuracy, and accelerate the learning of optimal policy. Then we create a process control strategy by incorporating VRER with the state-of-the-art step-based policy optimization approaches such as actor-critic method and proximal policy optimizations. The empirical study demonstrates that the proposed policy gradient methodology can significantly outperform the existing policy optimization approaches.  ( 2 min )
    Transferring Adversarial Robustness Through Robust Representation Matching. (arXiv:2202.09994v2 [cs.LG] UPDATED)
    With the widespread use of machine learning, concerns over its security and reliability have become prevalent. As such, many have developed defenses to harden neural networks against adversarial examples, imperceptibly perturbed inputs that are reliably misclassified. Adversarial training in which adversarial examples are generated and used during training is one of the few known defenses able to reliably withstand such attacks against neural networks. However, adversarial training imposes a significant training overhead and scales poorly with model complexity and input dimension. In this paper, we propose Robust Representation Matching (RRM), a low-cost method to transfer the robustness of an adversarially trained model to a new model being trained for the same task irrespective of architectural differences. Inspired by student-teacher learning, our method introduces a novel training loss that encourages the student to learn the teacher's robust representations. Compared to prior works, RRM is superior with respect to both model performance and adversarial training time. On CIFAR-10, RRM trains a robust model $\sim 1.8\times$ faster than the state-of-the-art. Furthermore, RRM remains effective on higher-dimensional datasets. On Restricted-ImageNet, RRM trains a ResNet50 model $\sim 18\times$ faster than standard adversarial training.  ( 2 min )
    Bayesian Sample Size Prediction for Online Activity. (arXiv:2111.12157v2 [stat.ML] UPDATED)
    In many contexts it is useful to predict the number of individuals in some population who will initiate a particular activity during a given period. For example, the number of users who will install a software update, the number of customers who will use a new feature on a website or who will participate in an A/B test. In practical settings, there is heterogeneity amongst individuals with regard to the distribution of time until they will initiate. For these reasons it is inappropriate to assume that the number of new individuals observed on successive days will be identically distributed. Given observations on the number of unique users participating in an initial period, we present a simple but novel Bayesian method for predicting the number of additional individuals who will participate during a subsequent period. We illustrate the performance of the method in predicting sample size in online experimentation.  ( 2 min )
    Journaling Data for Daily PHQ-2 Depression Prediction and Forecasting. (arXiv:2205.03391v1 [cs.LG])
    Digital health applications are becoming increasingly important for assessing and monitoring the wellbeing of people suffering from mental health conditions like depression. A common target of said applications is to predict the results of self-assessed Patient-Health-Questionnaires (PHQ), indicating current symptom severity of depressive individuals. In this work, we explore the potential of using actively-collected data to predict and forecast daily PHQ-2 scores on a newly-collected longitudinal dataset. We obtain a best MAE of 1.417 for daily prediction of PHQ-2 scores, which specifically in the used dataset have a range of 0 to 12, using leave-one-subject-out cross-validation, as well as a best MAE of 1.914 for forecasting PHQ-2 scores using data from up to the last 7 days. This illustrates the additive value that can be obtained by incorporating actively-collected data in a depression monitoring application.  ( 2 min )
    Differentially Private Generalized Linear Models Revisited. (arXiv:2205.03014v1 [cs.LG])
    We study the problem of $(\epsilon,\delta)$-differentially private learning of linear predictors with convex losses. We provide results for two subclasses of loss functions. The first case is when the loss is smooth and non-negative but not necessarily Lipschitz (such as the squared loss). For this case, we establish an upper bound on the excess population risk of $\tilde{O}\left(\frac{\Vert w^*\Vert}{\sqrt{n}} + \min\left\{\frac{\Vert w^* \Vert^2}{(n\epsilon)^{2/3}},\frac{\sqrt{d}\Vert w^*\Vert^2}{n\epsilon}\right\}\right)$, where $n$ is the number of samples, $d$ is the dimension of the problem, and $w^*$ is the minimizer of the population risk. Apart from the dependence on $\Vert w^\ast\Vert$, our bound is essentially tight in all parameters. In particular, we show a lower bound of $\tilde{\Omega}\left(\frac{1}{\sqrt{n}} + {\min\left\{\frac{\Vert w^*\Vert^{4/3}}{(n\epsilon)^{2/3}}, \frac{\sqrt{d}\Vert w^*\Vert}{n\epsilon}\right\}}\right)$. We also revisit the previously studied case of Lipschitz losses [SSTT20]. For this case, we close the gap in the existing work and show that the optimal rate is (up to log factors) $\Theta\left(\frac{\Vert w^*\Vert}{\sqrt{n}} + \min\left\{\frac{\Vert w^*\Vert}{\sqrt{n\epsilon}},\frac{\sqrt{\text{rank}}\Vert w^*\Vert}{n\epsilon}\right\}\right)$, where $\text{rank}$ is the rank of the design matrix. This improves over existing work in the high privacy regime. Finally, our algorithms involve a private model selection approach that we develop to enable attaining the stated rates without a-priori knowledge of $\Vert w^*\Vert$.  ( 2 min )
    Multichannel Synthetic Preictal EEG Signals to Enhance the Prediction of Epileptic Seizures. (arXiv:2205.03239v1 [eess.SP])
    Epilepsy is a chronic neurological disorder affecting 1\% of people worldwide, deep learning (DL) algorithms-based electroencephalograph (EEG) analysis provides the possibility for accurate epileptic seizure (ES) prediction, thereby benefiting patients suffering from epilepsy. To identify the preictal region that precedes the onset of seizure, a large number of annotated EEG signals are required to train DL algorithms. However, the scarcity of seizure onsets leads to significant insufficiency of data for training the DL algorithms. To overcome this data insufficiency, in this paper, we propose a preictal artificial signal synthesis algorithm based on a generative adversarial network to generate synthetic multichannel EEG preictal samples. A high-quality single-channel architecture, determined by visual and statistical evaluations, is used to train the generators of multichannel samples. The effectiveness of the synthetic samples is evaluated by comparing the ES prediction performances without and with synthetic preictal sample augmentation. The leave-one-seizure-out cross validation ES prediction accuracy and corresponding area under the receiver operating characteristic curve evaluation improve from 73.0\% and 0.676 to 78.0\% and 0.704 by 10$\times$ synthetic sample augmentation, respectively. The obtained results indicate that synthetic preictal samples are effective for enhancing ES prediction performance.  ( 2 min )
    Over-the-Air Federated Multi-Task Learning Over MIMO Multiple Access Channels. (arXiv:2112.13603v2 [eess.SP] UPDATED)
    With the explosive growth of data and wireless devices, federated learning (FL) over wireless medium has emerged as a promising technology for large-scale distributed intelligent systems. Yet, the urgent demand for ubiquitous intelligence will generate a large number of concurrent FL tasks, which may seriously aggravate the scarcity of communication resources. By exploiting the analog superposition of electromagnetic waves, over-the-air computation (AirComp) is an appealing solution to alleviate the burden of communication required by FL. However, sharing frequency-time resources in over-the-air computation inevitably brings about the problem of inter-task interference, which poses a new challenge that needs to be appropriately addressed. In this paper, we study over-the-air federated multi-task learning (OA-FMTL) over the multiple-input multiple-output (MIMO) multiple access (MAC) channel. We propose a novel model aggregation method for the alignment of local gradients of different devices, which alleviates the straggler problem in over-the-air computation due to the channel heterogeneity. We establish a communication-learning analysis framework for the proposed OA-FMTL scheme by considering the spatial correlation between devices, and formulate an optimization problem for the design of transceiver beamforming and device selection. To solve this problem, we develop an algorithm by using alternating optimization (AO) and fractional programming (FP), which effectively mitigates the impact of inter-task interference on the FL learning performance. We show that due to the use of the new model aggregation method, device selection is no longer essential, thereby avoiding the heavy computational burden involved in selecting active devices. Numerical results demonstrate the validity of the analysis and the superb performance of the proposed scheme.  ( 2 min )
    Transferring Chemical and Energetic Knowledge Between Molecular Systems with Machine Learning. (arXiv:2205.03339v1 [physics.chem-ph])
    Predicting structural and energetic properties of a molecular system is one of the fundamental tasks in molecular simulations, and it has use cases in chemistry, biology, and medicine. In the past decade, the advent of machine learning algorithms has impacted on molecular simulations for various tasks, including property prediction of atomistic systems. In this paper, we propose a novel methodology for transferring knowledge obtained from simple molecular systems to a more complex one, possessing a significantly larger number of atoms and degrees of freedom. In particular, we focus on the classification of high and low free-energy states. Our approach relies on utilizing (i) a novel hypergraph representation of molecules, encoding all relevant information for characterizing the potential energy of a conformation, and (ii) novel message passing and pooling layers for processing and making predictions on such hypergraph-structured data. Despite the complexity of the problem, our results show a remarkable AUC of 0.92 for transfer learning from tri-alanine to the deca-alanine system. Moreover, we show that the very same transfer learning approach can be used to group, in an unsupervised way, various secondary structures of deca-alanine in clusters having similar free-energy values. Our study represents a proof of concept that reliable transfer learning models for molecular systems can be designed paving the way to unexplored routes in prediction of structural and energetic properties of biologically relevant systems.  ( 2 min )
    Optimally tackling covariate shift in RKHS-based nonparametric regression. (arXiv:2205.02986v1 [math.ST])
    We study the covariate shift problem in the context of nonparametric regression over a reproducing kernel Hilbert space (RKHS). We focus on two natural families of covariate shift problems defined using the likelihood ratios between the source and target distributions. When the likelihood ratios are uniformly bounded, we prove that the kernel ridge regression (KRR) estimator with a carefully chosen regularization parameter is minimax rate-optimal (up to a log factor) for a large family of RKHSs with regular kernel eigenvalues. Interestingly, KRR does not require full knowledge of the likelihood ratio apart from an upper bound on it. In striking contrast to the standard statistical setting without covariate shift, we also demonstrate that a na\"\i ve estimator, which minimizes the empirical risk over the function class, is strictly suboptimal under covariate shift as compared to KRR. We then address the larger class of covariate shift problems where likelihood ratio is possibly unbounded yet has a finite second moment. Here, we show via careful simulations that KRR fails to attain the optimal rate. Instead, we propose a reweighted KRR estimator that weights samples based on a careful truncation of the likelihood ratios. Again, we are able to show that this estimator is minimax optimal, up to logarithmic factors.  ( 2 min )
    Benchmarking Econometric and Machine Learning Methodologies in Nowcasting. (arXiv:2205.03318v1 [stat.ML])
    Nowcasting can play a key role in giving policymakers timelier insight to data published with a significant time lag, such as final GDP figures. Currently, there are a plethora of methodologies and approaches for practitioners to choose from. However, there lacks a comprehensive comparison of these disparate approaches in terms of predictive performance and characteristics. This paper addresses that deficiency by examining the performance of 12 different methodologies in nowcasting US quarterly GDP growth, including all the methods most commonly employed in nowcasting, as well as some of the most popular traditional machine learning approaches. Performance was assessed on three different tumultuous periods in US economic history: the early 1980s recession, the 2008 financial crisis, and the COVID crisis. The two best performing methodologies in the analysis were long short-term memory artificial neural networks (LSTM) and Bayesian vector autoregression (BVAR). To facilitate further application and testing of each of the examined methodologies, an open-source repository containing boilerplate code that can be applied to different datasets is published alongside the paper, available at: github.com/dhopp1/nowcasting_benchmark.
    Perseus: A Simple High-Order Regularization Method for Variational Inequalities. (arXiv:2205.03202v1 [math.OC])
    This paper settles an open and challenging question pertaining to the design of simple high-order regularization methods for solving smooth and monotone variational inequalities (VIs). A VI involves finding $x^\star \in \mathcal{X}$ such that $\langle F(x), x - x^\star\rangle \geq 0$ for all $x \in \mathcal{X}$ and we consider the setting where $F: \mathbb{R}^d \mapsto \mathbb{R}^d$ is smooth with up to $(p-1)^{th}$-order derivatives. For the case of $p = 2$,~\citet{Nesterov-2006-Constrained} extended the cubic regularized Newton's method to VIs with a global rate of $O(\epsilon^{-1})$. \citet{Monteiro-2012-Iteration} proposed another second-order method which achieved an improved rate of $O(\epsilon^{-2/3}\log(1/\epsilon))$, but this method required a nontrivial binary search procedure as an inner loop. High-order methods based on similar binary search procedures have been further developed and shown to achieve a rate of $O(\epsilon^{-2/(p+1)}\log(1/\epsilon))$. However, such search procedure can be computationally prohibitive in practice and the problem of finding a simple high-order regularization methods remains as an open and challenging question in optimization theory. We propose a $p^{th}$-order method which does \textit{not} require any binary search scheme and is guaranteed to converge to a weak solution with a global rate of $O(\epsilon^{-2/(p+1)})$. A version with restarting attains a global linear and local superlinear convergence rate for smooth and strongly monotone VIs. Further, our method achieves a global rate of $O(\epsilon^{-2/p})$ for solving smooth and non-monotone VIs satisfying the Minty condition; moreover, the restarted version again attains a global linear and local superlinear convergence rate if the strong Minty condition holds.
    Out-of-Distribution Detection for Medical Applications: Guidelines for Practical Evaluation. (arXiv:2109.14885v2 [cs.LG] UPDATED)
    Detection of Out-of-Distribution (OOD) samples in real time is a crucial safety check for deployment of machine learning models in the medical field. Despite a growing number of uncertainty quantification techniques, there is a lack of evaluation guidelines on how to select OOD detection methods in practice. This gap impedes implementation of OOD detection methods for real-world applications. Here, we propose a series of practical considerations and tests to choose the best OOD detector for a specific medical dataset. These guidelines are illustrated on a real-life use case of Electronic Health Records (EHR). Our results can serve as a guide for implementation of OOD detection methods in clinical practice, mitigating risks associated with the use of machine learning models in healthcare.
    Tensor Principal Component Analysis in High Dimensional CP Models. (arXiv:2108.04428v3 [stat.ML] UPDATED)
    The CP decomposition for high dimensional non-orthogonal spiked tensors is an important problem with broad applications across many disciplines. However, previous works with theoretical guarantee typically assume restrictive incoherence conditions on the basis vectors for the CP components. In this paper, we propose new computationally efficient composite PCA and concurrent orthogonalization algorithms for tensor CP decomposition with theoretical guarantees under mild incoherence conditions. The composite PCA applies the principal component or singular value decompositions twice, first to a matrix unfolding of the tensor data to obtain singular vectors and then to the matrix folding of the singular vectors obtained in the first step. It can be used as an initialization for any iterative optimization schemes for the tensor CP decomposition. The concurrent orthogonalization algorithm iteratively estimates the basis vector in each mode of the tensor by simultaneously applying projections to the orthogonal complements of the spaces generated by other CP components in other modes. It is designed to improve the alternating least squares estimator and other forms of the high order orthogonal iteration for tensors with low or moderately high CP ranks, and it is guaranteed to converge rapidly when the error of any given initial estimator is bounded by a small constant. Our theoretical investigation provides estimation accuracy and convergence rates for the two proposed algorithms. Both proposed algorithms are applicable to deterministic tensor, its noisy version, and the order-$2K$ covariance tensor of order-$K$ tensor data in a factor model with uncorrelated factors. Our implementations on synthetic data demonstrate significant practical superiority of our approach over existing methods.
    Transformer Embeddings of Irregularly Spaced Events and Their Participants. (arXiv:2201.00044v3 [cs.LG] UPDATED)
    The neural Hawkes process (Mei & Eisner, 2017) is a generative model of irregularly spaced sequences of discrete events. To handle complex domains with many event types, Mei et al. (2020a) further consider a setting in which each event in the sequence updates a deductive database of facts (via domain-specific pattern-matching rules); future events are then conditioned on the database contents. They show how to convert such a symbolic system into a neuro-symbolic continuous-time generative model, in which each database fact and the possible event has a time-varying embedding that is derived from its symbolic provenance. In this paper, we modify both models, replacing their recurrent LSTM-based architectures with flatter attention-based architectures (Vaswani et al., 2017), which are simpler and more parallelizable. This does not appear to hurt our accuracy, which is comparable to or better than that of the original models as well as (where applicable) previous attention-based methods (Zuo et al., 2020; Zhang et al., 2020a).
    Ultra-sensitive Flexible Sponge-Sensor Array for Muscle Activities Detection and Human Limb Motion Recognition. (arXiv:2205.03238v1 [eess.SP])
    Human limb motion tracking and recognition plays an important role in medical rehabilitation training, lower limb assistance, prosthetics design for amputees, feedback control for assistive robots, etc. Lightweight wearable sensors, including inertial sensors, surface electromyography sensors, and flexible strain/pressure, are promising to become the next-generation human motion capture devices. Herein, we present a wireless wearable device consisting of a sixteen-channel flexible sponge-based pressure sensor array to recognize various human lower limb motions by detecting contours on the human skin caused by calf gastrocnemius muscle actions. Each sensing element is a round porous structure of thin carbon nanotube/polydimethylsiloxane nanocomposites with a diameter of 4 mm and thickness of about 400 {\mu}m. Three human subjects were recruited to perform ten different lower limb motions while wearing the developed device. The motion classification result with the support vector machine method shows a macro-recall of about 94.48% for all ten motions tested. This work demonstrates a portable wearable muscle activity detection device with a lower limb motion recognition application, which can be potentially used in assistive robot control, healthcare, sports monitoring, etc.
    R-GCN: The R Could Stand for Random. (arXiv:2203.02424v2 [cs.LG] UPDATED)
    The inception of the Relational Graph Convolutional Network (R-GCN) marked a milestone in the Semantic Web domain as a widely cited method that generalises end-to-end hierarchical representation learning to Knowledge Graphs (KGs). R-GCNs generate representations for nodes of interest by repeatedly aggregating parameterised, relation-specific transformations of their neighbours. However, in this paper, we argue that the the R-GCN's main contribution lies in this "message passing" paradigm, rather than the learned weights. To this end, we introduce the "Random Relational Graph Convolutional Network" (RR-GCN), which leaves all parameters untrained and thus constructs node embeddings by aggregating randomly transformed random representations from neighbours, i.e., with no learned parameters. We empirically show that RR-GCNs can compete with fully trained R-GCNs in both node classification and link prediction settings.
    MPAF: Model Poisoning Attacks to Federated Learning based on Fake Clients. (arXiv:2203.08669v2 [cs.CR] UPDATED)
    Existing model poisoning attacks to federated learning assume that an attacker has access to a large fraction of compromised genuine clients. However, such assumption is not realistic in production federated learning systems that involve millions of clients. In this work, we propose the first Model Poisoning Attack based on Fake clients called MPAF. Specifically, we assume the attacker injects fake clients to a federated learning system and sends carefully crafted fake local model updates to the cloud server during training, such that the learnt global model has low accuracy for many indiscriminate test inputs. Towards this goal, our attack drags the global model towards an attacker-chosen base model that has low accuracy. Specifically, in each round of federated learning, the fake clients craft fake local model updates that point to the base model and scale them up to amplify their impact before sending them to the cloud server. Our experiments show that MPAF can significantly decrease the test accuracy of the global model, even if classical defenses and norm clipping are adopted, highlighting the need for more advanced defenses.
    Let's Go to the Alien Zoo: Introducing an Experimental Framework to Study Usability of Counterfactual Explanations for Machine Learning. (arXiv:2205.03398v1 [cs.HC])
    To foster usefulness and accountability of machine learning (ML), it is essential to explain a model's decisions in addition to evaluating its performance. Accordingly, the field of explainable artificial intelligence (XAI) has resurfaced as a topic of active research, offering approaches to address the "how" and "why" of automated decision-making. Within this domain, counterfactual explanations (CFEs) have gained considerable traction as a psychologically grounded approach to generate post-hoc explanations. To do so, CFEs highlight what changes to a model's input would have changed its prediction in a particular way. However, despite the introduction of numerous CFE approaches, their usability has yet to be thoroughly validated at the human level. Thus, to advance the field of XAI, we introduce the Alien Zoo, an engaging, web-based and game-inspired experimental framework. The Alien Zoo provides the means to evaluate usability of CFEs for gaining new knowledge from an automated system, targeting novice users in a domain-general context. As a proof of concept, we demonstrate the practical efficacy and feasibility of this approach in a user study. Our results suggest that users benefit from receiving CFEs compared to no explanation, both in terms of objective performance in the proposed iterative learning task, and subjective usability. With this work, we aim to equip research groups and practitioners with the means to easily run controlled and well-powered user studies to complement their otherwise often more technology-oriented work. Thus, in the interest of reproducible research, we provide the entire code, together with the underlying models and user data.
    Vehicle management in a modular production context using Deep Q-Learning. (arXiv:2205.03294v1 [cs.LG])
    We investigate the feasibility of deploying Deep-Q based deep reinforcement learning agents to job-shop scheduling problems in the context of modular production facilities, using discrete event simulations for the environment. These environments are comprised of a source and sink for the parts to be processed, as well as (several) workstations. The agents are trained to schedule automated guided vehicles to transport the parts back and forth between those stations in an optimal fashion. Starting from a very simplistic setup, we increase the complexity of the environment and compare the agents' performances with well established heuristic approaches, such as first-in-first-out based agents, cost tables and a nearest-neighbor approach. We furthermore seek particular configurations of the environments in which the heuristic approaches struggle, to investigate to what degree the Deep-Q agents are affected by these challenges. We find that Deep-Q based agents show comparable performance as the heuristic baselines. Furthermore, our findings suggest that the DRL agents exhibit an increased robustness to noise, as compared to the conventional approaches. Overall, we find that DRL agents constitute a valuable approach for this type of scheduling problems.
    Fundamental Performance Limits for Sensor-Based Robot Control and Policy Learning. (arXiv:2202.00129v2 [cs.RO] UPDATED)
    Our goal is to develop theory and algorithms for establishing fundamental limits on performance for a given task imposed by a robot's sensors. In order to achieve this, we define a quantity that captures the amount of task-relevant information provided by a sensor. Using a novel version of the generalized Fano inequality from information theory, we demonstrate that this quantity provides an upper bound on the highest achievable expected reward for one-step decision making tasks. We then extend this bound to multi-step problems via a dynamic programming approach. We present algorithms for numerically computing the resulting bounds, and demonstrate our approach on three examples: (i) the lava problem from the literature on partially observable Markov decision processes, (ii) an example with continuous state and observation spaces corresponding to a robot catching a freely-falling object, and (iii) obstacle avoidance using a depth sensor with non-Gaussian noise. We demonstrate the ability of our approach to establish strong limits on achievable performance for these problems by comparing our upper bounds with achievable lower bounds (computed by synthesizing or learning concrete control policies).
    DADApy: Distance-based Analysis of DAta-manifolds in Python. (arXiv:2205.03373v1 [cs.LG])
    DADApy is a python software package for analysing and characterising high-dimensional data manifolds. It provides methods for estimating the intrinsic dimension and the probability density, for performing density-based clustering and for comparing different distance metrics. We review the main functionalities of the package and exemplify its usage in toy cases and in a real-world application. The package is freely available under the open-source Apache 2.0 license and can be downloaded from the Github page https://github.com/sissa-data-science/DADApy.
    Designing Robust Biotechnological Processes Regarding Variabilities using Multi-Objective Optimization Applied to a Biopharmaceutical Seed Train Design. (arXiv:2205.03261v1 [cs.LG])
    Development and optimization of biopharmaceutical production processes with cell cultures is cost- and time-consuming and often performed rather empirically. Efficient optimization of multiple-objectives like process time, viable cell density, number of operating steps & cultivation scales, required medium, amount of product as well as product quality depicts a promising approach. This contribution presents a workflow which couples uncertainty-based upstream simulation and Bayes optimization using Gaussian processes. Its application is demonstrated in a simulation case study for a relevant industrial task in process development, the design of a robust cell culture expansion process (seed train), meaning that despite uncertainties and variabilities concerning cell growth, low variations of viable cell density during the seed train are obtained. Compared to a non-optimized reference seed train, the optimized process showed much lower deviation rates regarding viable cell densities (<~10% instead of 41.7%) using 5 or 4 shake flask scales and seed train duration could be reduced by 56 h from 576 h to 520 h. Overall, it is shown that applying Bayes optimization allows for optimization of a multi-objective optimization function with several optimizable input variables and under a considerable amount of constraints with a low computational effort. This approach provides the potential to be used in form of a decision tool, e.g. for the choice of an optimal and robust seed train design or for further optimization tasks within process development.  ( 2 min )
    PTFlash: A deep learning framework for isothermal two-phase equilibrium calculations. (arXiv:2205.03090v1 [physics.chem-ph])
    Phase equilibrium calculations are an essential part of numerical simulations of multi-component multi-phase flow in porous media, accounting for the largest share of the computational time. In this work, we introduce a GPUenabled, fast, and parallel framework, PTFlash, that vectorizes algorithms required for isothermal two-phase flash calculations using PyTorch, and can facilitate a wide range of downstream applications. In addition, to further accelerate PTFlash, we design two task-specific neural networks, one for predicting the stability of given mixtures and the other for providing estimates of the distribution coefficients, which are trained offline and help shorten computation time by sidestepping stability analysis and reducing the number of iterations to reach convergence. The evaluation of PTFlash was conducted on three case studies involving hydrocarbons, CO$_2$ and N$_2$ , for which the phase equilibrium was tested over a large range of temperature, pressure and composition conditions, using the Soave-Redlich-Kwong (SRK) equation of state. We compare PTFlash with an in-house thermodynamic library, Carnot, written in C++ and performing flash calculations one by one on CPU. Results show speed-ups on large scale calculations up to two order of magnitudes, while maintaining perfect precision with the reference solution provided by Carnot.  ( 2 min )
    On boundary conditions parametrized by analytic functions. (arXiv:2205.03185v1 [cs.LG])
    Computer algebra can answer various questions about partial differential equations using symbolic algorithms. However, the inclusion of data into equations is rare in computer algebra. Therefore, recently, computer algebra models have been combined with Gaussian processes, a regression model in machine learning, to describe the behavior of certain differential equations under data. While it was possible to describe polynomial boundary conditions in this context, we extend these models to analytic boundary conditions. Additionally, we describe the necessary algorithms for Gr\"obner and Janet bases of Weyl algebras with certain analytic coefficients. Using these algorithms, we provide examples of divergence-free flow in domains bounded by analytic functions and adapted to observations.  ( 2 min )
    Federated Learning with Noisy User Feedback. (arXiv:2205.03092v1 [cs.LG])
    Machine Learning (ML) systems are getting increasingly popular, and drive more and more applications and services in our daily life. This has led to growing concerns over user privacy, since human interaction data typically needs to be transmitted to the cloud in order to train and improve such systems. Federated learning (FL) has recently emerged as a method for training ML models on edge devices using sensitive user data and is seen as a way to mitigate concerns over data privacy. However, since ML models are most commonly trained with label supervision, we need a way to extract labels on edge to make FL viable. In this work, we propose a strategy for training FL models using positive and negative user feedback. We also design a novel framework to study different noise patterns in user feedback, and explore how well standard noise-robust objectives can help mitigate this noise when training models in a federated setting. We evaluate our proposed training setup through detailed experiments on two text classification datasets and analyze the effects of varying levels of user reliability and feedback noise on model performance. We show that our method improves substantially over a self-training baseline, achieving performance closer to models trained with full supervision.  ( 2 min )
    Investigating and Explaining the Frequency Bias in Image Classification. (arXiv:2205.03154v1 [cs.CV])
    CNNs exhibit many behaviors different from humans, one of which is the capability of employing high-frequency components. This paper discusses the frequency bias phenomenon in image classification tasks: the high-frequency components are actually much less exploited than the low- and mid-frequency components. We first investigate the frequency bias phenomenon by presenting two observations on feature discrimination and learning priority. Furthermore, we hypothesize that (i) the spectral density, (ii) class consistency directly affect the frequency bias. Specifically, our investigations verify that the spectral density of datasets mainly affects the learning priority, while the class consistency mainly affects the feature discrimination.  ( 2 min )
    TTRS: Tinkoff Transactions Recommender System benchmark. (arXiv:2110.05589v2 [cs.LG] UPDATED)
    Over the past decade, tremendous progress has been made in inventing new RecSys methods. However, one of the fundamental problems of the RecSys research community remains the lack of applied datasets and benchmarks with well-defined evaluation rules and metrics to test these novel approaches. In this article, we present the TTRS - Tinkoff Transactions Recommender System benchmark. This financial transaction benchmark contains over 2 million interactions between almost 10,000 users and more than 1,000 merchant brands over 14 months. To the best of our knowledge, this is the first publicly available financial transactions dataset. To make it more suitable for possible applications, we provide a complete description of the data collection pipeline, its preprocessing, and the resulting dataset statistics. We also present a comprehensive comparison of the current popular RecSys methods on the next-period recommendation task and conduct a detailed analysis of their performance against various metrics and datasets.  ( 2 min )
    Building a 3-Player Mahjong AI using Deep Reinforcement Learning. (arXiv:2202.12847v2 [cs.AI] UPDATED)
    Mahjong is a popular multi-player imperfect-information game developed in China in the late 19th-century, with some very challenging features for AI research. Sanma, being a 3-player variant of the Japanese Riichi Mahjong, possesses unique characteristics including fewer tiles and, consequently, a more aggressive playing style. It is thus challenging and of great research interest in its own right, but has not yet been explored. In this paper, we present Meowjong, an AI for Sanma using deep reinforcement learning. We define an informative and compact 2-dimensional data structure for encoding the observable information in a Sanma game. We pre-train 5 convolutional neural networks (CNNs) for Sanma's 5 actions -- discard, Pon, Kan, Kita and Riichi, and enhance the major action's model, namely the discard model, via self-play reinforcement learning using the Monte Carlo policy gradient method. Meowjong's models achieve test accuracies comparable with AIs for 4-player Mahjong through supervised learning, and gain a significant further enhancement from reinforcement learning. Being the first ever AI in Sanma, we claim that Meowjong stands as a state-of-the-art in this game.  ( 2 min )
    Towards QD-suite: developing a set of benchmarks for Quality-Diversity algorithms. (arXiv:2205.03207v1 [cs.LG])
    While the field of Quality-Diversity (QD) has grown into a distinct branch of stochastic optimization, a few problems, in particular locomotion and navigation tasks, have become de facto standards. Are such benchmarks sufficient? Are they representative of the key challenges faced by QD algorithms? Do they provide the ability to focus on one particular challenge by properly disentangling it from others? Do they have much predictive power in terms of scalability and generalization? Existing benchmarks are not standardized, and there is currently no MNIST equivalent for QD. Inspired by recent works on Reinforcement Learning benchmarks, we argue that the identification of challenges faced by QD methods and the development of targeted, challenging, scalable but affordable benchmarks is an important step. As an initial effort, we identify three problems that are challenging in sparse reward settings, and propose associated benchmarks: (1) Behavior metric bias, which can result from the use of metrics that do not match the structure of the behavior space. (2) Behavioral Plateaus, with varying characteristics, such that escaping them would require adaptive QD algorithms and (3) Evolvability Traps, where small variations in genotype result in large behavioral changes. The environments that we propose satisfy the properties listed above.  ( 2 min )
    Remote Blood Oxygen Estimation From Videos Using Neural Networks. (arXiv:2107.05087v2 [cs.LG] UPDATED)
    Blood oxygen saturation (SpO$_2$) is an essential indicator of respiratory functionality and is receiving increasing attention during the COVID-19 pandemic. Clinical findings show that it is possible for COVID-19 patients to have significantly low SpO$_2$ before any obvious symptoms. The prevalence of cameras has motivated researchers to investigate methods for monitoring SpO$_2$ using videos. Most prior schemes involving smartphones are contact-based: They require a fingertip to cover the phone's camera and the nearby light source to capture re-emitted light from the illuminated tissue. In this paper, we propose the first convolutional neural network based noncontact SpO$_2$ estimation scheme using smartphone cameras. The scheme analyzes the videos of a participant's hand for physiological sensing, which is convenient and comfortable, and can protect their privacy and allow for keeping face masks on. We design our neural network architectures inspired by the optophysiological models for SpO$_2$ measurement and demonstrate the explainability by visualizing the weights for channel combination. Our proposed models outperform the state-of-the-art model that is designed for contact-based SpO$_2$ measurement, showing the potential of our proposed method to contribute to public health. We also analyze the impact of skin type and the side of a hand on SpO$_2$ estimation performance.  ( 2 min )
    TAGLETS: A System for Automatic Semi-Supervised Learning with Auxiliary Data. (arXiv:2111.04798v3 [cs.LG] UPDATED)
    Machine learning practitioners often have access to a spectrum of data: labeled data for the target task (which is often limited), unlabeled data, and auxiliary data, the many available labeled datasets for other tasks. We describe TAGLETS, a system built to study techniques for automatically exploiting all three types of data and creating high-quality, servable classifiers. The key components of TAGLETS are: (1) auxiliary data organized according to a knowledge graph, (2) modules encapsulating different methods for exploiting auxiliary and unlabeled data, and (3) a distillation stage in which the ensembled modules are combined into a servable model. We compare TAGLETS with state-of-the-art transfer learning and semi-supervised learning methods on four image classification tasks. Our study covers a range of settings, varying the amount of labeled data and the semantic relatedness of the auxiliary data to the target task. We find that the intelligent incorporation of auxiliary and unlabeled data into multiple learning techniques enables TAGLETS to match-and most often significantly surpass-these alternatives. TAGLETS is available as an open-source system at github.com/BatsResearch/taglets.  ( 2 min )
    Optimal quantum dataset for learning a unitary transformation. (arXiv:2203.00546v2 [quant-ph] UPDATED)
    Unitary transformations formulate the time evolution of quantum states. How to learn a unitary transformation efficiently is a fundamental problem in quantum machine learning. The most natural and leading strategy is to train a quantum machine learning model based on a quantum dataset. Although presence of more training data results in better models, using too much data reduces the efficiency of training. In this work, we solve the problem on the minimum size of sufficient quantum datasets for learning a unitary transformation exactly, which reveals the power and limitation of quantum data. First, we prove that the minimum size of dataset with pure states is $2^n$ for learning an $n$-qubit unitary transformation. To fully explore the capability of quantum data, we introduce a practical quantum dataset consisting of $n+1$ elementary tensor product states that are sufficient for exact training. The main idea is to simplify the structure utilizing decoupling, which leads to an exponential improvement on the size over the datasets with pure states. Furthermore, we show that the size of quantum dataset with mixed states can be reduced to a constant, which yields an optimal quantum dataset for learning a unitary. We showcase the applications of our results in oracle compiling and Hamiltonian simulation. Notably, to accurately simulate a 3-qubit one-dimensional nearest-neighbor Heisenberg model, our circuit only uses $48$ elementary quantum gates, which is significantly less than $4320$ gates in the circuit constructed by the Trotter-Suzuki product formula.  ( 2 min )
    LPGNet: Link Private Graph Networks for Node Classification. (arXiv:2205.03105v1 [cs.LG])
    Classification tasks on labeled graph-structured data have many important applications ranging from social recommendation to financial modeling. Deep neural networks are increasingly being used for node classification on graphs, wherein nodes with similar features have to be given the same label. Graph convolutional networks (GCNs) are one such widely studied neural network architecture that perform well on this task. However, powerful link-stealing attacks on GCNs have recently shown that even with black-box access to the trained model, inferring which links (or edges) are present in the training graph is practical. In this paper, we present a new neural network architecture called LPGNet for training on graphs with privacy-sensitive edges. LPGNet provides differential privacy (DP) guarantees for edges using a novel design for how graph edge structure is used during training. We empirically show that LPGNet models often lie in the sweet spot between providing privacy and utility: They can offer better utility than "trivially" private architectures which use no edge information (e.g., vanilla MLPs) and better resilience against existing link-stealing attacks than vanilla GCNs which use the full edge structure. LPGNet also offers consistently better privacy-utility tradeoffs than DPGCN, which is the state-of-the-art mechanism for retrofitting differential privacy into conventional GCNs, in most of our evaluated datasets.  ( 2 min )
    Controlled Dropout for Uncertainty Estimation. (arXiv:2205.03109v1 [cs.LG])
    Uncertainty quantification in a neural network is one of the most discussed topics for safety-critical applications. Though Neural Networks (NNs) have achieved state-of-the-art performance for many applications, they still provide unreliable point predictions, which lack information about uncertainty estimates. Among various methods to enable neural networks to estimate uncertainty, Monte Carlo (MC) dropout has gained much popularity in a short period due to its simplicity. In this study, we present a new version of the traditional dropout layer where we are able to fix the number of dropout configurations. As such, each layer can take and apply the new dropout layer in the MC method to quantify the uncertainty associated with NN predictions. We conduct experiments on both toy and realistic datasets and compare the results with the MC method using the traditional dropout layer. Performance analysis utilizing uncertainty evaluation metrics corroborates that our dropout layer offers better performance in most cases.  ( 2 min )
    Newton-MR: Inexact Newton Method With Minimum Residual Sub-problem Solver. (arXiv:1810.00303v4 [math.OC] UPDATED)
    We consider a variant of inexact Newton Method, called Newton-MR, in which the least-squares sub-problems are solved approximately using Minimum Residual method. By construction, Newton-MR can be readily applied for unconstrained optimization of a class of non-convex problems known as invex, which subsumes convexity as a sub-class. For invex optimization, instead of the classical Lipschitz continuity assumptions on gradient and Hessian, Newton-MR's global convergence can be guaranteed under a weaker notion of joint regularity of Hessian and gradient. We also obtain Newton-MR's problem-independent local convergence to the set of minima. We show that fast local/global convergence can be guaranteed under a novel inexactness condition, which, to our knowledge, is much weaker than the prior related works. Numerical results demonstrate the performance of Newton-MR as compared with several other Newton-type alternatives on a few machine learning problems.  ( 2 min )
    How to Spend Your Robot Time: Bridging Kickstarting and Offline Reinforcement Learning for Vision-based Robotic Manipulation. (arXiv:2205.03353v1 [cs.RO])
    Reinforcement learning (RL) has been shown to be effective at learning control from experience. However, RL typically requires a large amount of online interaction with the environment. This limits its applicability to real-world settings, such as in robotics, where such interaction is expensive. In this work we investigate ways to minimize online interactions in a target task, by reusing a suboptimal policy we might have access to, for example from training on related prior tasks, or in simulation. To this end, we develop two RL algorithms that can speed up training by using not only the action distributions of teacher policies, but also data collected by such policies on the task at hand. We conduct a thorough experimental study of how to use suboptimal teachers on a challenging robotic manipulation benchmark on vision-based stacking with diverse objects. We compare our methods to offline, online, offline-to-online, and kickstarting RL algorithms. By doing so, we find that training on data from both the teacher and student, enables the best performance for limited data budgets. We examine how to best allocate a limited data budget -- on the target task -- between the teacher and the student policy, and report experiments using varying budgets, two teachers with different degrees of suboptimality, and five stacking tasks that require a diverse set of behaviors. Our analysis, both in simulation and in the real world, shows that our approach is the best across data budgets, while standard offline RL from teacher rollouts is surprisingly effective when enough data is given.  ( 2 min )
    HumanAL: Calibrating Human Matching Beyond a Single Task. (arXiv:2205.03209v1 [cs.DB])
    This work offers a novel view on the use of human input as labels, acknowledging that humans may err. We build a behavioral profile for human annotators which is used as a feature representation of the provided input. We show that by utilizing black-box machine learning, we can take into account human behavior and calibrate their input to improve the labeling quality. To support our claims and provide a proof-of-concept, we experiment with three different matching tasks, namely, schema matching, entity matching and text matching. Our empirical evaluation suggests that the method can improve the quality of gathered labels in multiple settings including cross-domain (across different matching tasks).  ( 2 min )
    Beyond backpropagation: implicit gradients for bilevel optimization. (arXiv:2205.03076v1 [cs.LG])
    This paper reviews gradient-based techniques to solve bilevel optimization problems. Bilevel optimization is a general way to frame the learning of systems that are implicitly defined through a quantity that they minimize. This characterization can be applied to neural networks, optimizers, algorithmic solvers and even physical systems, and allows for greater modeling flexibility compared to an explicit definition of such systems. Here we focus on gradient-based approaches that solve such problems. We distinguish them in two categories: those rooted in implicit differentiation, and those that leverage the equilibrium propagation theorem. We present the mathematical foundations that are behind such methods, introduce the gradient-estimation algorithms in detail and compare the competitive advantages of the different approaches.  ( 2 min )
    Fast Rate Generalization Error Bounds: Variations on a Theme. (arXiv:2205.03131v1 [cs.IT])
    A recent line of works, initiated by \cite{russo2016controlling} and \cite{xu2017information}, has shown that the generalization error of a learning algorithm can be upper bounded by information measures. In most of the relevant works, the convergence rate of the expected generalization error is in the form of O(\sqrt{\lambda/{n}}) where \lambda is some information-theoretic quantities such as the mutual information between the data sample and the learned hypothesis. However, such a learning rate is typically considered to be "slow", compared to a "fast rate" of O(1/n) in many learning scenarios. In this work, we first show that the square root does not necessarily imply a slow rate, and a fast rate (O(1/n)) result can still be obtained using this bound under appropriate assumptions. Furthermore, we identify the key conditions needed for the fast rate generalization error, which we call the (\eta,c)-central condition. Under this condition, we give information-theoretic bounds on the generalization error and excess risk, with a convergence rate of O(\lambda/{n}) for specific learning algorithms such as empirical risk minimization. Finally, analytical examples are given to show the effectiveness of the bounds.  ( 2 min )
    Characterizing TMS-EEG perturbation indexes using signal energy: initial study on Alzheimer's Disease classification. (arXiv:2205.03241v1 [eess.SP])
    Transcranial Magnetic Stimulation (TMS) combined with EEG recordings (TMS-EEG) has shown great potential in the study of the brain and in particular of Alzheimer's Disease (AD). In this study, we propose an automatic method of determining the duration of TMS induced perturbation of the EEG signal as a potential metric reflecting the brain's functional alterations. A preliminary study is conducted in patients with Alzheimer's disease (AD). Three metrics for characterizing the strength and duration of TMS evoked EEG (TEP) activity are proposed and their potential in identifying AD patients from healthy controls was investigated. A dataset of TMS-EEG recordings from 17 AD and 17 healthy controls (HC) was used in our analysis. A Random Forest classification algorithm was trained on the extracted TEP metrics and its performance is evaluated in a leave-one-subject-out cross-validation. The created model showed promising results in identifying AD patients from HC with an accuracy, sensitivity and specificity of 69.32%, 72.23% and 66.41%, respectively.  ( 2 min )
    Detection of Propaganda Techniques in Visuo-Lingual Metaphor in Memes. (arXiv:2205.02937v1 [cs.CV])
    The exponential rise of social media networks has allowed the production, distribution, and consumption of data at a phenomenal rate. Moreover, the social media revolution has brought a unique phenomenon to social media platforms called Internet memes. Internet memes are one of the most popular contents used on social media, and they can be in the form of images with a witty, catchy, or satirical text description. In this paper, we are dealing with propaganda that is often seen in Internet memes in recent times. Propaganda is communication, which frequently includes psychological and rhetorical techniques to manipulate or influence an audience to act or respond as the propagandist wants. To detect propaganda in Internet memes, we propose a multimodal deep learning fusion system that fuses the text and image feature representations and outperforms individual models based solely on either text or image modalities.  ( 2 min )
    Emp-RFT: Empathetic Response Generation via Recognizing Feature Transitions between Utterances. (arXiv:2205.03112v1 [cs.CL])
    Each utterance in multi-turn empathetic dialogues has features such as emotion, keywords, and utterance-level meaning. Feature transitions between utterances occur naturally. However, existing approaches fail to perceive the transitions because they extract features for the context at the coarse-grained level. To solve the above issue, we propose a novel approach of recognizing feature transitions between utterances, which helps understand the dialogue flow and better grasp the features of utterance that needs attention. Also, we introduce a response generation strategy to help focus on emotion and keywords related to appropriate features when generating responses. Experimental results show that our approach outperforms baselines and especially, achieves significant improvements on multi-turn dialogues.  ( 2 min )
    Scalable computation of prediction intervals for neural networks via matrix sketching. (arXiv:2205.03194v1 [stat.ML])
    Accounting for the uncertainty in the predictions of modern neural networks is a challenging and important task in many domains. Existing algorithms for uncertainty estimation require modifying the model architecture and training procedure (e.g., Bayesian neural networks) or dramatically increase the computational cost of predictions such as approaches based on ensembling. This work proposes a new algorithm that can be applied to a given trained neural network and produces approximate prediction intervals. The method is based on the classical delta method in statistics but achieves computational efficiency by using matrix sketching to approximate the Jacobian matrix. The resulting algorithm is competitive with state-of-the-art approaches for constructing predictive intervals on various regression datasets from the UCI repository.  ( 2 min )
    Belief propagation for permutations, rankings, and partial orders. (arXiv:2110.00513v2 [cs.AI] UPDATED)
    Many datasets give partial information about an ordering or ranking by indicating which team won a game, which item a user prefers, or who infected whom. We define a continuous spin system whose Gibbs distribution is the posterior distribution on permutations, given a probabilistic model of these interactions. Using the cavity method we derive a belief propagation algorithm that computes the marginal distribution of each node's position. In addition, the Bethe free energy lets us approximate the number of linear extensions of a partial order and perform model selection between competing probabilistic models, such as the Bradley-Terry-Luce model of noisy comparisons and its cousins.  ( 2 min )
    Learning Optimal Propagation for Graph Neural Networks. (arXiv:2205.02998v1 [cs.LG])
    Graph Neural Networks (GNNs) have achieved tremendous success in a variety of real-world applications by relying on the fixed graph data as input. However, the initial input graph might not be optimal in terms of specific downstream tasks, because of information scarcity, noise, adversarial attacks, or discrepancies between the distribution in graph topology, features, and groundtruth labels. In this paper, we propose a bi-level optimization-based approach for learning the optimal graph structure via directly learning the Personalized PageRank propagation matrix as well as the downstream semi-supervised node classification simultaneously. We also explore a low-rank approximation model for further reducing the time complexity. Empirical evaluations show the superior efficacy and robustness of the proposed model over all baseline methods.  ( 2 min )
    Explaining the Effectiveness of Multi-Task Learning for Efficient Knowledge Extraction from Spine MRI Reports. (arXiv:2205.02979v1 [cs.LG])
    Pretrained Transformer based models finetuned on domain specific corpora have changed the landscape of NLP. However, training or fine-tuning these models for individual tasks can be time consuming and resource intensive. Thus, a lot of current research is focused on using transformers for multi-task learning (Raffel et al.,2020) and how to group the tasks to help a multi-task model to learn effective representations that can be shared across tasks (Standley et al., 2020; Fifty et al., 2021). In this work, we show that a single multi-tasking model can match the performance of task specific models when the task specific models show similar representations across all of their hidden layers and their gradients are aligned, i.e. their gradients follow the same direction. We hypothesize that the above observations explain the effectiveness of multi-task learning. We validate our observations on our internal radiologist-annotated datasets on the cervical and lumbar spine. Our method is simple and intuitive, and can be used in a wide range of NLP problems.  ( 2 min )
    Design Target Achievement Index: A Differentiable Metric to Enhance Deep Generative Models in Multi-Objective Inverse Design. (arXiv:2205.03005v1 [cs.LG])
    Deep Generative Machine Learning Models have been growing in popularity across the design community thanks to their ability to learn and mimic complex data distributions. While early works are promising, further advancement will depend on addressing several critical considerations such as design quality, feasibility, novelty, and targeted inverse design. We propose the Design Target Achievement Index (DTAI), a differentiable, tunable metric that scores a design's ability to achieve designer-specified minimum performance targets. We demonstrate that DTAI can drastically improve the performance of generated designs when directly used as a training loss in Deep Generative Models. We apply the DTAI loss to a Performance-Augmented Diverse GAN (PaDGAN) and demonstrate superior generative performance compared to a set of baseline Deep Generative Models including a Multi-Objective PaDGAN and specialized tabular generation algorithms like the Conditional Tabular GAN (CTGAN). We further enhance PaDGAN with an auxiliary feasibility classifier to encourage feasible designs. To evaluate methods, we propose a comprehensive set of evaluation metrics for generative methods that focus on feasibility, diversity, and satisfaction of design performance targets. Methods are tested on a challenging benchmarking problem: the FRAMED bicycle frame design dataset featuring mixed-datatype parametric data, heavily skewed and multimodal distributions, and ten competing performance objectives.  ( 2 min )
    A CNN Approach for 5G mmWave Positioning Using Beamformed CSI Measurements. (arXiv:2205.03236v1 [eess.SP])
    The advent of Artificial Intelligence (AI) has impacted all aspects of human life. One of the concrete examples of AI impact is visible in radio positioning. In this article, for the first time we utilize the power of AI by training a Convolutional Neural Network (CNN) using 5G New Radio (NR) fingerprints consisting of beamformed Channel State Information (CSI). By observing CSI, it is possible to characterize the multipath channel between the transmitter and the receiver, and thus provide a good source of spatiotemporal data to find the position of a User Equipment (UE). We collect ray-tracing-based 5G NR CSI from an urban area. The CSI data of the signals from one Base Station (BS) is collected at the reference points with known positions to train a CNN. We evaluate our work by testing: a) the robustness of the trained network for estimating the positions for the new measurements on the same reference points and b) the accuracy of the CNN-based position estimation while the UE is on points other than the reference points. The results prove that our trained network for a specific urban environment can estimate the UE position with a minimum mean error of 0.98 m.  ( 2 min )
    Dynamic Sparse Training for Deep Reinforcement Learning. (arXiv:2106.04217v3 [cs.LG] UPDATED)
    Deep reinforcement learning (DRL) agents are trained through trial-and-error interactions with the environment. This leads to a long training time for dense neural networks to achieve good performance. Hence, prohibitive computation and memory resources are consumed. Recently, learning efficient DRL agents has received increasing attention. Yet, current methods focus on accelerating inference time. In this paper, we introduce for the first time a dynamic sparse training approach for deep reinforcement learning to accelerate the training process. The proposed approach trains a sparse neural network from scratch and dynamically adapts its topology to the changing data distribution during training. Experiments on continuous control tasks show that our dynamic sparse agents achieve higher performance than the equivalent dense methods, reduce the parameter count and floating-point operations (FLOPs) by 50%, and have a faster learning speed that enables reaching the performance of dense agents with 40-50% reduction in the training steps.  ( 2 min )
    Green Accelerated Hoeffding Tree. (arXiv:2205.03184v1 [cs.LG])
    State-of-the-art machine learning solutions mainly focus on creating highly accurate models without constraints on hardware resources. Stream mining algorithms are designed to run on resource-constrained devices, thus a focus on low power and energy and memory-efficient is essential. The Hoeffding tree algorithm is able to create energy-efficient models, but at the cost of less accurate trees in comparison to their ensembles counterpart. Ensembles of Hoeffding trees, on the other hand, create a highly accurate forest of trees but consume five times more energy on average. An extension that tried to obtain similar results to ensembles of Hoeffding trees was the Extremely Fast Decision Tree (EFDT). This paper presents the Green Accelerated Hoeffding Tree (GAHT) algorithm, an extension of the EFDT algorithm with a lower energy and memory footprint and the same (or higher for some datasets) accuracy levels. GAHT grows the tree setting individual splitting criteria for each node, based on the distribution of the number of instances over each particular leaf. The results show that GAHT is able to achieve the same competitive accuracy results compared to EFDT and ensembles of Hoeffding trees while reducing the energy consumption up to 70%.  ( 2 min )
    A Deep Bayesian Bandits Approach for Anticancer Therapy: Exploration via Functional Prior. (arXiv:2205.02944v1 [cs.LG])
    Learning personalized cancer treatment with machine learning holds great promise to improve cancer patients' chance of survival. Despite recent advances in machine learning and precision oncology, this approach remains challenging as collecting data in preclinical/clinical studies for modeling multiple treatment efficacies is often an expensive, time-consuming process. Moreover, the randomization in treatment allocation proves to be suboptimal since some participants/samples are not receiving the most appropriate treatments during the trial. To address this challenge, we formulate drug screening study as a "contextual bandit" problem, in which an algorithm selects anticancer therapeutics based on contextual information about cancer cell lines while adapting its treatment strategy to maximize treatment response in an "online" fashion. We propose using a novel deep Bayesian bandits framework that uses functional prior to approximate posterior for drug response prediction based on multi-modal information consisting of genomic features and drug structure. We empirically evaluate our method on three large-scale in vitro pharmacogenomic datasets and show that our approach outperforms several benchmarks in identifying optimal treatment for a given cell line.  ( 2 min )
    FedNLP: Benchmarking Federated Learning Methods for Natural Language Processing Tasks. (arXiv:2104.08815v3 [cs.CL] UPDATED)
    Increasing concerns and regulations about data privacy and sparsity necessitate the study of privacy-preserving, decentralized learning methods for natural language processing (NLP) tasks. Federated learning (FL) provides promising approaches for a large number of clients (e.g., personal devices or organizations) to collaboratively learn a shared global model to benefit all clients while allowing users to keep their data locally. Despite interest in studying FL methods for NLP tasks, a systematic comparison and analysis is lacking in the literature. Herein, we present the FedNLP, a benchmarking framework for evaluating federated learning methods on four different task formulations: text classification, sequence tagging, question answering, and seq2seq. We propose a universal interface between Transformer-based language models (e.g., BERT, BART) and FL methods (e.g., FedAvg, FedOPT, etc.) under various non-IID partitioning strategies. Our extensive experiments with FedNLP provide empirical comparisons between FL methods and helps us better understand the inherent challenges of this direction. The comprehensive analysis points to intriguing and exciting future research aimed at developing FL methods for NLP tasks.  ( 2 min )
    Probabilistic learning constrained by realizations using a weak formulation of Fourier transform of probability measures. (arXiv:2205.03078v1 [stat.ML])
    This paper deals with the taking into account a given set of realizations as constraints in the Kullback-Leibler minimum principle, which is used as a probabilistic learning algorithm. This permits the effective integration of data into predictive models. We consider the probabilistic learning of a random vector that is made up of either a quantity of interest (unsupervised case) or the couple of the quantity of interest and a control parameter (supervised case). A training set of independent realizations of this random vector is assumed to be given and to be generated with a prior probability measure that is unknown. A target set of realizations of the QoI is available for the two considered cases. The framework is the one of non-Gaussian problems in high dimension. A functional approach is developed on the basis of a weak formulation of the Fourier transform of probability measures (characteristic functions). The construction makes it possible to take into account the target set of realizations of the QoI in the Kullback-Leibler minimum principle. The proposed approach allows for estimating the posterior probability measure of the QoI (unsupervised case) or of the posterior joint probability measure of the QoI with the control parameter (supervised case). The existence and the uniqueness of the posterior probability measure is analyzed for the two cases. The numerical aspects are detailed in order to facilitate the implementation of the proposed method. The presented application in high dimension demonstrates the efficiency and the robustness of the proposed algorithm.  ( 2 min )
    Imperceptible Backdoor Attack: From Input Space to Feature Representation. (arXiv:2205.03190v1 [cs.CR])
    Backdoor attacks are rapidly emerging threats to deep neural networks (DNNs). In the backdoor attack scenario, attackers usually implant the backdoor into the target model by manipulating the training dataset or training process. Then, the compromised model behaves normally for benign input yet makes mistakes when the pre-defined trigger appears. In this paper, we analyze the drawbacks of existing attack approaches and propose a novel imperceptible backdoor attack. We treat the trigger pattern as a special kind of noise following a multinomial distribution. A U-net-based network is employed to generate concrete parameters of multinomial distribution for each benign input. This elaborated trigger ensures that our approach is invisible to both humans and statistical detection. Besides the design of the trigger, we also consider the robustness of our approach against model diagnose-based defences. We force the feature representation of malicious input stamped with the trigger to be entangled with the benign one. We demonstrate the effectiveness and robustness against multiple state-of-the-art defences through extensive datasets and networks. Our trigger only modifies less than 1\% pixels of a benign image while the modification magnitude is 1. Our source code is available at https://github.com/Ekko-zn/IJCAI2022-Backdoor.  ( 2 min )
    The NT-Xent loss upper bound. (arXiv:2205.03169v1 [cs.LG])
    Self-supervised learning is a growing paradigm in deep representation learning, showing great generalization capabilities and competitive performance in low-labeled data regimes. The SimCLR framework proposes the NT-Xent loss for contrastive representation learning. The objective of the loss function is to maximize agreement, similarity, between sampled positive pairs. This short paper derives and proposes an upper bound for the loss and average similarity. An analysis of the implications is however not provided, but we strongly encourage anyone in the field to conduct this.  ( 2 min )
    UAV-aided RF Mapping for Sensing and Connectivity in Wireless Networks. (arXiv:2205.03335v1 [cs.IT])
    The use of unmanned aerial vehicles (UAV) as flying radio access network (RAN) nodes offers a promising complement to traditional fixed terrestrial deployments. More recently yet still in the context of wireless networks, drones have also been envisioned for use as radio frequency (RF) sensing and localization devices. In both cases, the advantage of using UAVs lies in their ability to navigate themselves freely in 3D and in a timely manner to locations of space where the obtained network throughput or sensing performance is optimal. In practice, the selection of a proper location or trajectory for the UAV very much depends on local terrain features, including the position of surrounding radio obstacles. Hence, the robot must be able to map the features of its radio environment as it performs its data communication or sensing services. The challenges related to this task, referred here as radio mapping, are discussed in this paper. Its promises related to efficient trajectory design for autonomous radio-aware UAVs are highlighted, along with algorithm solutions. The advantages induced by radio-mapping in terms of connectivity, sensing, and localization performance are illustrated.  ( 2 min )
    Immiscible Color Flows in Optimal Transport Networks for Image Classification. (arXiv:2205.02938v1 [cs.CV])
    In classification tasks, it is crucial to meaningfully exploit information contained in data. Here, we propose a physics-inspired dynamical system that adapts Optimal Transport principles to effectively leverage color distributions of images. Our dynamics regulates immiscible fluxes of colors traveling on a network built from images. Instead of aggregating colors together, it treats them as different commodities that interact with a shared capacity on edges. Our method outperforms competitor algorithms on image classification tasks in datasets where color information matters.  ( 2 min )
    Crop Type Identification for Smallholding Farms: Analyzing Spatial, Temporal and Spectral Resolutions in Satellite Imagery. (arXiv:2205.03104v1 [cs.CV])
    The integration of the modern Machine Learning (ML) models into remote sensing and agriculture has expanded the scope of the application of satellite images in the agriculture domain. In this paper, we present how the accuracy of crop type identification improves as we move from medium-spatiotemporal-resolution (MSTR) to high-spatiotemporal-resolution (HSTR) satellite images. We further demonstrate that high spectral resolution in satellite imagery can improve prediction performance for low spatial and temporal resolutions (LSTR) images. The F1-score is increased by 7% when using multispectral data of MSTR images as compared to the best results obtained from HSTR images. Similarly, when crop season based time series of multispectral data is used we observe an increase of 1.2% in the F1-score. The outcome motivates further advancements in the field of synthetic band generation.  ( 2 min )
    Low-rank Tensor Learning with Nonconvex Overlapped Nuclear Norm Regularization. (arXiv:2205.03059v1 [cs.LG])
    Nonconvex regularization has been popularly used in low-rank matrix learning. However, extending it for low-rank tensor learning is still computationally expensive. To address this problem, we develop an efficient solver for use with a nonconvex extension of the overlapped nuclear norm regularizer. Based on the proximal average algorithm, the proposed algorithm can avoid expensive tensor folding/unfolding operations. A special "sparse plus low-rank" structure is maintained throughout the iterations, and allows fast computation of the individual proximal steps. Empirical convergence is further improved with the use of adaptive momentum. We provide convergence guarantees to critical points on smooth losses and also on objectives satisfying the Kurdyka-{\L}ojasiewicz condition. While the optimization problem is nonconvex and nonsmooth, we show that its critical points still have good statistical performance on the tensor completion problem. Experiments on various synthetic and real-world data sets show that the proposed algorithm is efficient in both time and space and more accurate than the existing state-of-the-art.  ( 2 min )
    GANs as Gradient Flows that Converge. (arXiv:2205.02910v1 [cs.LG])
    This paper approaches the unsupervised learning problem by gradient descent in the space of probability density functions. Our main result shows that along the gradient flow induced by a distribution-dependent ordinary differential equation (ODE), the unknown data distribution emerges as the long-time limit of this flow of densities. That is, one can uncover the data distribution by simulating the distribution-dependent ODE. Intriguingly, we find that the simulation of the ODE is equivalent to the training of generative adversarial networks (GANs). The GAN framework, by definition a non-cooperative game between a generator and a discriminator, can therefore be viewed alternatively as a cooperative game between a navigator and a calibrator (in collaboration to simulate the ODE). At the theoretic level, this new perspective simplifies the analysis of GANs and gives new insight into their performance. To construct a solution to the distribution-dependent ODE, we first show that the associated nonlinear Fokker-Planck equation has a unique weak solution, using the Crandall-Liggett theorem for differential equations in Banach spaces. From this solution to the Fokker-Planck equation, we construct a unique solution to the ODE, relying on Trevisan's superposition principle. The convergence of the induced gradient flow to the data distribution is obtained by analyzing the Fokker-Planck equation.  ( 2 min )
    Geodesics, Non-linearities and the Archive of Novelty Search. (arXiv:2205.03162v1 [cs.LG])
    The Novelty Search (NS) algorithm was proposed more than a decade ago. However, the mechanisms behind its empirical success are still not well formalized/understood. This short note focuses on the effects of the archive on exploration. Experimental evidence from a few application domains suggests that archive-based NS performs in general better than when Novelty is solely computed with respect to the population. An argument that is often encountered in the literature is that the archive prevents exploration from backtracking or cycling, i.e. from revisiting previously encountered areas in the behavior space. We argue that this is not a complete or accurate explanation as backtracking - beside often being desirable - can actually be enabled by the archive. Through low-dimensional/analytical examples, we show that a key effect of the archive is that it counterbalances the exploration biases that result, among other factors, from the use of inadequate behavior metrics and the non-linearities of the behavior mapping. Our observations seem to hint that attributing a more active role to the archive in sampling can be beneficial.  ( 2 min )
    Real Time On Sensor Gait Phase Detection with 0.5KB Deep Learning Model. (arXiv:2205.03234v1 [eess.SP])
    Gait phase detection with convolution neural network provides accurate classification but demands high computational cost, which inhibits real time low power on-sensor processing. This paper presents a segmentation based gait phase detection with a width and depth downscaled U-Net like model that only needs 0.5KB model size and 67K operations per second with 95.9% accuracy to be easily fitted into resource limited on sensor microcontroller.  ( 2 min )
    Sound2Synth: Interpreting Sound via FM Synthesizer Parameters Estimation. (arXiv:2205.03043v1 [cs.SD])
    Synthesizer is a type of electronic musical instrument that is now widely used in modern music production and sound design. Each parameters configuration of a synthesizer produces a unique timbre and can be viewed as a unique instrument. The problem of estimating a set of parameters configuration that best restore a sound timbre is an important yet complicated problem, i.e.: the synthesizer parameters estimation problem. We proposed a multi-modal deep-learning-based pipeline Sound2Synth, together with a network structure Prime-Dilated Convolution (PDC) specially designed to solve this problem. Our method achieved not only SOTA but also the first real-world applicable results on Dexed synthesizer, a popular FM synthesizer.  ( 2 min )
    New-Onset Diabetes Assessment Using Artificial Intelligence-Enhanced Electrocardiography. (arXiv:2205.02900v1 [cs.LG])
    Undiagnosed diabetes is present in 21.4% of adults with diabetes. Diabetes can remain asymptomatic and undetected due to limitations in screening rates. To address this issue, questionnaires, such as the American Diabetes Association (ADA) Risk test, have been recommended for use by physicians and the public. Based on evidence that blood glucose concentration can affect cardiac electrophysiology, we hypothesized that an artificial intelligence (AI)-enhanced electrocardiogram (ECG) could identify adults with new-onset diabetes. We trained a neural network to estimate HbA1c using a 12-lead ECG and readily available demographics. We retrospectively assembled a dataset comprised of patients with paired ECG and HbA1c data. The population of patients who receive both an ECG and HbA1c may a biased sample of the complete outpatient population, so we adjusted the importance placed on each patient to generate a more representative pseudo-population. We found ECG-based assessment outperforms the ADA Risk test, achieving a higher area under the curve (0.80 vs. 0.68) and positive predictive value (14% vs. 9%) -- 2.6 times the prevalence of diabetes in the cohort. The AI-enhanced ECG significantly outperforms electrophysiologist interpretation of the ECG, suggesting that the task is beyond current clinical capabilities. Given the prevalence of ECGs in clinics and via wearable devices, such a tool would make precise, automated diabetes assessment widely accessible.  ( 2 min )
    The Road to Explainability is Paved with Bias: Measuring the Fairness of Explanations. (arXiv:2205.03295v1 [cs.LG])
    Machine learning models in safety-critical settings like healthcare are often blackboxes: they contain a large number of parameters which are not transparent to users. Post-hoc explainability methods where a simple, human-interpretable model imitates the behavior of these blackbox models are often proposed to help users trust model predictions. In this work, we audit the quality of such explanations for different protected subgroups using real data from four settings in finance, healthcare, college admissions, and the US justice system. Across two different blackbox model architectures and four popular explainability methods, we find that the approximation quality of explanation models, also known as the fidelity, differs significantly between subgroups. We also demonstrate that pairing explainability methods with recent advances in robust machine learning can improve explanation fairness in some settings. However, we highlight the importance of communicating details of non-zero fidelity gaps to users, since a single solution might not exist across all settings. Finally, we discuss the implications of unfair explanation models as a challenging and understudied problem facing the machine learning community.  ( 2 min )
    Semi-Supervised Imitation Learning of Team Policies from Suboptimal Demonstrations. (arXiv:2205.02959v1 [cs.AI])
    We present Bayesian Team Imitation Learner (BTIL), an imitation learning algorithm to model behavior of teams performing sequential tasks in Markovian domains. In contrast to existing multi-agent imitation learning techniques, BTIL explicitly models and infers the time-varying mental states of team members, thereby enabling learning of decentralized team policies from demonstrations of suboptimal teamwork. Further, to allow for sample- and label-efficient policy learning from small datasets, BTIL employs a Bayesian perspective and is capable of learning from semi-supervised demonstrations. We demonstrate and benchmark the performance of BTIL on synthetic multi-agent tasks as well as a novel dataset of human-agent teamwork. Our experiments show that BTIL can successfully learn team policies from demonstrations despite the influence of team members' (time-varying and potentially misaligned) mental states on their behavior.  ( 2 min )
    Explainable multi-class anomaly detection on functional data. (arXiv:2205.02935v1 [stat.ML])
    In this paper we describe an approach for anomaly detection and its explainability in multivariate functional data. The anomaly detection procedure consists of transforming the series into a vector of features and using an Isolation forest algorithm. The explainable procedure is based on the computation of the SHAP coefficients and on the use of a supervised decision tree. We apply it on simulated data to measure the performance of our method and on real data coming from industry.  ( 2 min )
    Understanding Urban Water Consumption using Remotely Sensed Data. (arXiv:2205.02932v1 [cs.CV])
    Urban metabolism is an active field of research that deals with the estimation of emissions and resource consumption from urban regions. The analysis could be carried out through a manual surveyor by the implementation of elegant machine learning algorithms. In this exploratory work, we estimate the water consumption by the buildings in the region captured by satellite imagery. To this end, we break our analysis into three parts: i) Identification of building pixels, given a satellite image, followed by ii) identification of the building type (residential/non-residential) from the building pixels, and finally iii) using the building pixels along with their type to estimate the water consumption using the average per unit area consumption for different building types as obtained from municipal surveys.  ( 2 min )
    Multi-confound regression adversarial network for deep learning-based diagnosis on highly heterogenous clinical data. (arXiv:2205.02885v1 [cs.LG])
    Automated disease detection in medical images using deep learning holds promise to improve the diagnostic ability of radiologists, but routinely collected clinical data frequently contains technical and demographic confounding factors that differ between hospitals, negatively affecting the robustness of diagnostic deep learning models. Thus, there is a critical need for deep learning models that can train on imbalanced datasets without overfitting to site-specific confounding factors. In this work, we developed a novel deep learning architecture, MUCRAN (Multi-Confound Regression Adversarial Network), to train a deep learning model on highly heterogeneous clinical data while regressing demographic and technical confounding factors. We trained MUCRAN using 16,821 clinical T1 Axial brain MRIs collected from Massachusetts General Hospital before 2019 and tested it using post-2019 data to distinguish Alzheimer's disease (AD) patients, identified using both prescriptions of AD drugs and ICD codes, from a non-medicated control group. In external validation tests using MRI data from other hospitals, the model showed a robust performance of over 90% accuracy on newly collected data. This work shows the feasibility of deep learning-based diagnosis in real-world clinical data.  ( 2 min )
    Over-The-Air Federated Learning under Byzantine Attacks. (arXiv:2205.02949v1 [cs.LG])
    Federated learning (FL) is a promising solution to enable many AI applications, where sensitive datasets from distributed clients are needed for collaboratively training a global model. FL allows the clients to participate in the training phase, governed by a central server, without sharing their local data. One of the main challenges of FL is the communication overhead, where the model updates of the participating clients are sent to the central server at each global training round. Over-the-air computation (AirComp) has been recently proposed to alleviate the communication bottleneck where the model updates are sent simultaneously over the multiple-access channel. However, simple averaging of the model updates via AirComp makes the learning process vulnerable to random or intended modifications of the local model updates of some Byzantine clients. In this paper, we propose a transmission and aggregation framework to reduce the effect of such attacks while preserving the benefits of AirComp for FL. For the proposed robust approach, the central server divides the participating clients randomly into groups and allocates a transmission time slot for each group. The updates of the different groups are then aggregated using a robust aggregation technique. We extend our approach to handle the case of non-i.i.d. local data, where a resampling step is added before robust aggregation. We analyze the convergence of the proposed approach for both cases of i.i.d. and non-i.i.d. data and demonstrate that the proposed algorithm converges at a linear rate to a neighborhood of the optimal solution. Experiments on real datasets are provided to confirm the robustness of the proposed approach.  ( 2 min )
    Quantification of Robotic Surgeries with Vision-Based Deep Learning. (arXiv:2205.03028v1 [cs.RO])
    Surgery is a high-stakes domain where surgeons must navigate critical anatomical structures and actively avoid potential complications while achieving the main task at hand. Such surgical activity has been shown to affect long-term patient outcomes. To better understand this relationship, whose mechanics remain unknown for the majority of surgical procedures, we hypothesize that the core elements of surgery must first be quantified in a reliable, objective, and scalable manner. We believe this is a prerequisite for the provision of surgical feedback and modulation of surgeon performance in pursuit of improved patient outcomes. To holistically quantify surgeries, we propose a unified deep learning framework, entitled Roboformer, which operates exclusively on videos recorded during surgery to independently achieve multiple tasks: surgical phase recognition (the what of surgery), gesture classification and skills assessment (the how of surgery). We validated our framework on four video-based datasets of two commonly-encountered types of steps (dissection and suturing) within minimally-invasive robotic surgeries. We demonstrated that our framework can generalize well to unseen videos, surgeons, medical centres, and surgical procedures. We also found that our framework, which naturally lends itself to explainable findings, identified relevant information when achieving a particular task. These findings are likely to instill surgeons with more confidence in our framework's behaviour, increasing the likelihood of clinical adoption, and thus paving the way for more targeted surgical feedback.  ( 2 min )
    Low Dimensional Invariant Embeddings for Universal Geometric Learning. (arXiv:2205.02956v1 [cs.LG])
    This paper studies separating invariants: mappings on $d$-dimensional semi-algebraic subsets of $D$ dimensional Euclidean domains which are invariant to semi-algebraic group actions and separate orbits. The motivation for this study comes from the usefulness of separating invariants in proving universality of equivariant neural network architectures. We observe that in several cases the cardinality of separating invariants proposed in the machine learning literature is much larger than the ambient dimension $D$. As a result, the theoretical universal constructions based on these separating invariants is unrealistically large. Our goal in this paper is to resolve this issue. We show that when a continuous family of semi-algebraic separating invariants is available, separation can be obtained by randomly selecting $2d+1 $ of these invariants. We apply this methodology to obtain an efficient scheme for computing separating invariants for several classical group actions which have been studied in the invariant learning literature. Examples include matrix multiplication actions on point clouds by permutations, rotations, and various other linear groups.  ( 2 min )
    Large Scale Transfer Learning for Differentially Private Image Classification. (arXiv:2205.02973v1 [cs.LG])
    Differential Privacy (DP) provides a formal framework for training machine learning models with individual example level privacy. Training models with DP protects the model against leakage of sensitive data in a potentially adversarial setting. In the field of deep learning, Differentially Private Stochastic Gradient Descent (DP-SGD) has emerged as a popular private training algorithm. Private training using DP-SGD protects against leakage by injecting noise into individual example gradients, such that the trained model weights become nearly independent of the use any particular training example. While this result is quite appealing, the computational cost of training large-scale models with DP-SGD is substantially higher than non-private training. This is further exacerbated by the fact that increasing the number of parameters leads to larger degradation in utility with DP. In this work, we zoom in on the ImageNet dataset and demonstrate that similar to the non-private case, pre-training over-parameterized models on a large public dataset can lead to substantial gains when the model is finetuned privately. Moreover, by systematically comparing private and non-private models across a range of huge batch sizes, we find that similar to non-private setting, choice of optimizer can further improve performance substantially with DP. By switching from DP-SGD to DP-LAMB we saw improvement of up to 20$\%$ points (absolute). Finally, we show that finetuning just the last layer for a \emph{single step} in the full batch setting leads to both SOTA results of 81.7 $\%$ under a wide privacy budget range of $\epsilon \in [4, 10]$ and $\delta$ = $10^{-6}$ while minimizing the computational overhead substantially.  ( 2 min )
    Evaluating Context for Deep Object Detectors. (arXiv:2205.02887v1 [cs.CV])
    Which object detector is suitable for your context sensitive task? Deep object detectors exploit scene context for recognition differently. In this paper, we group object detectors into 3 categories in terms of context use: no context by cropping the input (RCNN), partial context by cropping the featuremap (two-stage methods) and full context without any cropping (single-stage methods). We systematically evaluate the effect of context for each deep detector category. We create a fully controlled dataset for varying context and investigate the context for deep detectors. We also evaluate gradually removing the background context and the foreground object on MS COCO. We demonstrate that single-stage and two-stage object detectors can and will use the context by virtue of their large receptive field. Thus, choosing the best object detector may depend on the application context.  ( 2 min )
    Learn-to-Race Challenge 2022: Benchmarking Safe Learning and Cross-domain Generalisation in Autonomous Racing. (arXiv:2205.02953v1 [cs.RO])
    We present the results of our autonomous racing virtual challenge, based on the newly-released Learn-to-Race (L2R) simulation framework, which seeks to encourage interdisciplinary research in autonomous driving and to help advance the state of the art on a realistic benchmark. Analogous to racing being used to test cutting-edge vehicles, we envision autonomous racing to serve as a particularly challenging proving ground for autonomous agents as: (i) they need to make sub-second, safety-critical decisions in a complex, fast-changing environment; and (ii) both perception and control must be robust to distribution shifts, novel road features, and unseen obstacles. Thus, the main goal of the challenge is to evaluate the joint safety, performance, and generalisation capabilities of reinforcement learning agents on multi-modal perception, through a two-stage process. In the first stage of the challenge, we evaluate an autonomous agent's ability to drive as fast as possible, while adhering to safety constraints. In the second stage, we additionally require the agent to adapt to an unseen racetrack through safe exploration. In this paper, we describe the new L2R Task 2.0 benchmark, with refined metrics and baseline approaches. We also provide an overview of deployment, evaluation, and rankings for the inaugural instance of the L2R Autonomous Racing Virtual Challenge (supported by Carnegie Mellon University, Arrival Ltd., AICrowd, Amazon Web Services, and Honda Research), which officially used the new L2R Task 2.0 benchmark and received over 20,100 views, 437 active participants, 46 teams, and 733 model submissions -- from 88 unique institutions, in 28 different countries. Finally, we release leaderboard results from the challenge and provide description of the two top-ranking approaches in cross-domain model transfer, across multiple sensor configurations and simulated races.  ( 3 min )
    Neural Jacobian Fields: Learning Intrinsic Mappings of Arbitrary Meshes. (arXiv:2205.02904v1 [cs.GR])
    This paper introduces a framework designed to accurately predict piecewise linear mappings of arbitrary meshes via a neural network, enabling training and evaluating over heterogeneous collections of meshes that do not share a triangulation, as well as producing highly detail-preserving maps whose accuracy exceeds current state of the art. The framework is based on reducing the neural aspect to a prediction of a matrix for a single given point, conditioned on a global shape descriptor. The field of matrices is then projected onto the tangent bundle of the given mesh, and used as candidate jacobians for the predicted map. The map is computed by a standard Poisson solve, implemented as a differentiable layer with cached pre-factorization for efficient training. This construction is agnostic to the triangulation of the input, thereby enabling applications on datasets with varying triangulations. At the same time, by operating in the intrinsic gradient domain of each individual mesh, it allows the framework to predict highly-accurate mappings. We validate these properties by conducting experiments over a broad range of scenarios, from semantic ones such as morphing, registration, and deformation transfer, to optimization-based ones, such as emulating elastic deformations and contact correction, as well as being the first work, to our knowledge, to tackle the task of learning to compute UV parameterizations of arbitrary meshes. The results exhibit the high accuracy of the method as well as its versatility, as it is readily applied to the above scenarios without any changes to the framework.  ( 2 min )
    Putting Density Functional Theory to the Test in Machine-Learning-Accelerated Materials Discovery. (arXiv:2205.02967v1 [cond-mat.mtrl-sci])
    Accelerated discovery with machine learning (ML) has begun to provide the advances in efficiency needed to overcome the combinatorial challenge of computational materials design. Nevertheless, ML-accelerated discovery both inherits the biases of training data derived from density functional theory (DFT) and leads to many attempted calculations that are doomed to fail. Many compelling functional materials and catalytic processes involve strained chemical bonds, open-shell radicals and diradicals, or metal-organic bonds to open-shell transition-metal centers. Although promising targets, these materials present unique challenges for electronic structure methods and combinatorial challenges for their discovery. In this Perspective, we describe the advances needed in accuracy, efficiency, and approach beyond what is typical in conventional DFT-based ML workflows. These challenges have begun to be addressed through ML models trained to predict the results of multiple methods or the differences between them, enabling quantitative sensitivity analysis. For DFT to be trusted for a given data point in a high-throughput screen, it must pass a series of tests. ML models that predict the likelihood of calculation success and detect the presence of strong correlation will enable rapid diagnoses and adaptation strategies. These "decision engines" represent the first steps toward autonomous workflows that avoid the need for expert determination of the robustness of DFT-based materials discoveries.  ( 2 min )
    GreenDB: Toward a Product-by-Product Sustainability Database. (arXiv:2205.02908v1 [cs.LG])
    The production, shipping, usage, and disposal of consumer goods have a substantial impact on greenhouse gas emissions and the depletion of resources. Modern retail platforms rely heavily on Machine Learning (ML) for their search and recommender systems. Thus, ML can potentially support efforts towards more sustainable consumption patterns, for example, by accounting for sustainability aspects in product search or recommendations. However, leveraging ML potential for reaching sustainability goals requires data on sustainability. Unfortunately, no open and publicly available database integrates sustainability information on a product-by-product basis. In this work, we present the GreenDB, which fills this gap. Based on search logs of millions of users, we prioritize which products users care about most. The GreenDB schema extends the well-known schema.org Product definition and can be readily integrated into existing product catalogs to improve sustainability information available for search and recommendation experiences. We present our proof of concept implementation of a scraping system that creates the GreenDB dataset.  ( 2 min )
    Lagrangian PINNs: A causality-conforming solution to failure modes of physics-informed neural networks. (arXiv:2205.02902v1 [cs.LG])
    Physics-informed neural networks (PINNs) leverage neural-networks to find the solutions of partial differential equation (PDE)-constrained optimization problems with initial conditions and boundary conditions as soft constraints. These soft constraints are often considered to be the sources of the complexity in the training phase of PINNs. Here, we demonstrate that the challenge of training (i) persists even when the boundary conditions are strictly enforced, and (ii) is closely related to the Kolmogorov n-width associated with problems demonstrating transport, convection, traveling waves, or moving fronts. Given this realization, we describe the mechanism underlying the training schemes such as those used in eXtended PINNs (XPINN), curriculum regularization, and sequence-to-sequence learning. For an important category of PDEs, i.e., governed by non-linear convection-diffusion equation, we propose reformulating PINNs on a Lagrangian frame of reference, i.e., LPINNs, as a PDE-informed solution. A parallel architecture with two branches is proposed. One branch solves for the state variables on the characteristics, and the second branch solves for the low-dimensional characteristics curves. The proposed architecture conforms to the causality innate to the convection, and leverages the direction of travel of the information in the domain. Finally, we demonstrate that the loss landscapes of LPINNs are less sensitive to the so-called "complexity" of the problems, compared to those in the traditional PINNs in the Eulerian framework.  ( 2 min )
    Understanding Transfer Learning for Chest Radiograph Clinical Report Generation with Modified Transformer Architectures. (arXiv:2205.02841v1 [eess.IV])
    The image captioning task is increasingly prevalent in artificial intelligence applications for medicine. One important application is clinical report generation from chest radiographs. The clinical writing of unstructured reports is time consuming and error-prone. An automated system would improve standardization, error reduction, time consumption, and medical accessibility. In this paper we demonstrate the importance of domain specific pre-training and propose a modified transformer architecture for the medical image captioning task. To accomplish this, we train a series of modified transformers to generate clinical reports from chest radiograph image input. These modified transformers include: a meshed-memory augmented transformer architecture with visual extractor using ImageNet pre-trained weights, a meshed-memory augmented transformer architecture with visual extractor using CheXpert pre-trained weights, and a meshed-memory augmented transformer whose encoder is passed the concatenated embeddings using both ImageNet pre-trained weights and CheXpert pre-trained weights. We use BLEU(1-4), ROUGE-L, CIDEr, and the clinical CheXbert F1 scores to validate our models and demonstrate competitive scores with state of the art models. We provide evidence that ImageNet pre-training is ill-suited for the medical image captioning task, especially for less frequent conditions (eg: enlarged cardiomediastinum, lung lesion, pneumothorax). Furthermore, we demonstrate that the double feature model improves performance for specific medical conditions (edema, consolidation, pneumothorax, support devices) and overall CheXbert F1 score, and should be further developed in future work. Such a double feature model, including both ImageNet pre-training as well as domain specific pre-training, could be used in a wide range of image captioning models in medicine.  ( 2 min )
    Exploiting Ligand Additivity for Transferable Machine Learning of Multireference Character Across Known Transition Metal Complex Ligands. (arXiv:2205.02879v1 [cond-mat.mtrl-sci])
    Accurate virtual high-throughput screening (VHTS) of transition metal complexes (TMCs) remains challenging due to the possibility of high multi-reference (MR) character that complicates property evaluation. We compute MR diagnostics for over 5,000 ligands present in previously synthesized transition metal complexes in the Cambridge Structural Database (CSD). To accomplish this task, we introduce an iterative approach for consistent ligand charge assignment for ligands in the CSD. Across this set, we observe that MR character correlates linearly with the inverse value of the averaged bond order over all bonds in the molecule. We then demonstrate that ligand additivity of MR character holds in TMCs, which suggests that the TMC MR character can be inferred from the sum of the MR character of the ligands. Encouraged by this observation, we leverage ligand additivity and develop a ligand-derived machine learning representation to train neural networks to predict the MR character of TMCs from properties of the constituent ligands. This approach yields models with excellent performance and superior transferability to unseen ligand chemistry and compositions.  ( 2 min )
    Generative Adversarial Network Based Synthetic Learning and a Novel Domain Relevant Loss Term for Spine Radiographs. (arXiv:2205.02843v1 [eess.IV])
    Problem: There is a lack of big data for the training of deep learning models in medicine, characterized by the time cost of data collection and privacy concerns. Generative adversarial networks (GANs) offer both the potential to generate new data, as well as to use this newly generated data, without inclusion of patients' real data, for downstream applications. Approach: A series of GANs were trained and applied for a downstream computer vision spine radiograph abnormality classification task. Separate classifiers were trained with either access or no access to the original imaging. Trained GANs included a conditional StyleGAN2 with adaptive discriminator augmentation, a conditional StyleGAN2 with adaptive discriminator augmentation to generate spine radiographs conditional on lesion type, and using a novel clinical loss term for the generator a StyleGAN2 with adaptive discriminator augmentation conditional on abnormality (SpineGAN). Finally, a differential privacy imposed StyleGAN2 with adaptive discriminator augmentation conditional on abnormality was trained and an ablation study was performed on its differential privacy impositions. Key Results: We accomplish GAN generation of synthetic spine radiographs without meaningful input for the first time from a literature review. We further demonstrate the success of synthetic learning for the spine domain with a downstream clinical classification task (AUC of 0.830 using synthetic data compared to AUC of 0.886 using the real data). Importantly, the introduction of a new clinical loss term for the generator was found to increase generation recall as well as accelerate model training. Lastly, we demonstrate that, in a limited size medical dataset, differential privacy impositions severely impede GAN training, finding that this is specifically due to the requirement for gradient perturbation with noise.  ( 2 min )
    AdaTriplet: Adaptive Gradient Triplet Loss with Automatic Margin Learning for Forensic Medical Image Matching. (arXiv:2205.02849v1 [eess.IV])
    This paper tackles the challenge of forensic medical image matching (FMIM) using deep neural networks (DNNs). FMIM is a particular case of content-based image retrieval (CBIR). The main challenge in FMIM compared to the general case of CBIR, is that the subject to whom a query image belongs may be affected by aging and progressive degenerative disorders, making it difficult to match data on a subject level. CBIR with DNNs is generally solved by minimizing a ranking loss, such as Triplet loss (TL), computed on image representations extracted by a DNN from the original data. TL, in particular, operates on triplets: anchor, positive (similar to anchor) and negative (dissimilar to anchor). Although TL has been shown to perform well in many CBIR tasks, it still has limitations, which we identify and analyze in this work. In this paper, we introduce (i) the AdaTriplet loss -- an extension of TL whose gradients adapt to different difficulty levels of negative samples, and (ii) the AutoMargin method -- a technique to adjust hyperparameters of margin-based losses such as TL and our proposed loss dynamically. Our results are evaluated on two large-scale benchmarks for FMIM based on the Osteoarthritis Initiative and Chest X-ray-14 datasets. The codes allowing replication of this study have been made publicly available at \url{https://github.com/Oulu-IMEDS/AdaTriplet}.  ( 2 min )
  • Open

    On boundary conditions parametrized by analytic functions. (arXiv:2205.03185v1 [cs.LG])
    Computer algebra can answer various questions about partial differential equations using symbolic algorithms. However, the inclusion of data into equations is rare in computer algebra. Therefore, recently, computer algebra models have been combined with Gaussian processes, a regression model in machine learning, to describe the behavior of certain differential equations under data. While it was possible to describe polynomial boundary conditions in this context, we extend these models to analytic boundary conditions. Additionally, we describe the necessary algorithms for Gr\"obner and Janet bases of Weyl algebras with certain analytic coefficients. Using these algorithms, we provide examples of divergence-free flow in domains bounded by analytic functions and adapted to observations.  ( 2 min )
    Tensor Principal Component Analysis in High Dimensional CP Models. (arXiv:2108.04428v3 [stat.ML] UPDATED)
    The CP decomposition for high dimensional non-orthogonal spiked tensors is an important problem with broad applications across many disciplines. However, previous works with theoretical guarantee typically assume restrictive incoherence conditions on the basis vectors for the CP components. In this paper, we propose new computationally efficient composite PCA and concurrent orthogonalization algorithms for tensor CP decomposition with theoretical guarantees under mild incoherence conditions. The composite PCA applies the principal component or singular value decompositions twice, first to a matrix unfolding of the tensor data to obtain singular vectors and then to the matrix folding of the singular vectors obtained in the first step. It can be used as an initialization for any iterative optimization schemes for the tensor CP decomposition. The concurrent orthogonalization algorithm iteratively estimates the basis vector in each mode of the tensor by simultaneously applying projections to the orthogonal complements of the spaces generated by other CP components in other modes. It is designed to improve the alternating least squares estimator and other forms of the high order orthogonal iteration for tensors with low or moderately high CP ranks, and it is guaranteed to converge rapidly when the error of any given initial estimator is bounded by a small constant. Our theoretical investigation provides estimation accuracy and convergence rates for the two proposed algorithms. Both proposed algorithms are applicable to deterministic tensor, its noisy version, and the order-$2K$ covariance tensor of order-$K$ tensor data in a factor model with uncorrelated factors. Our implementations on synthetic data demonstrate significant practical superiority of our approach over existing methods.  ( 2 min )
    Gaussian Processes for Missing Value Imputation. (arXiv:2204.04648v2 [stat.ML] UPDATED)
    Missing values are common in many real-life datasets. However, most of the current machine learning methods can not handle missing values. This means that they should be imputed beforehand. Gaussian Processes (GPs) are non-parametric models with accurate uncertainty estimates that combined with sparse approximations and stochastic variational inference scale to large data sets. Sparse GPs can be used to compute a predictive distribution for missing data. Here, we present a hierarchical composition of sparse GPs that is used to predict missing values at each dimension using all the variables from the other dimensions. We call the approach missing GP (MGP). MGP can be trained simultaneously to impute all observed missing values. Specifically, it outputs a predictive distribution for each missing value that is then used in the imputation of other missing values. We evaluate MGP in one private clinical data set and four UCI datasets with a different percentage of missing values. We compare the performance of MGP with other state-of-the-art methods for imputing missing values, including variants based on sparse GPs and deep GPs. The results obtained show a significantly better performance of MGP.  ( 2 min )
    Bayesian Sample Size Prediction for Online Activity. (arXiv:2111.12157v2 [stat.ML] UPDATED)
    In many contexts it is useful to predict the number of individuals in some population who will initiate a particular activity during a given period. For example, the number of users who will install a software update, the number of customers who will use a new feature on a website or who will participate in an A/B test. In practical settings, there is heterogeneity amongst individuals with regard to the distribution of time until they will initiate. For these reasons it is inappropriate to assume that the number of new individuals observed on successive days will be identically distributed. Given observations on the number of unique users participating in an initial period, we present a simple but novel Bayesian method for predicting the number of additional individuals who will participate during a subsequent period. We illustrate the performance of the method in predicting sample size in online experimentation.  ( 2 min )
    DADApy: Distance-based Analysis of DAta-manifolds in Python. (arXiv:2205.03373v1 [cs.LG])
    DADApy is a python software package for analysing and characterising high-dimensional data manifolds. It provides methods for estimating the intrinsic dimension and the probability density, for performing density-based clustering and for comparing different distance metrics. We review the main functionalities of the package and exemplify its usage in toy cases and in a real-world application. The package is freely available under the open-source Apache 2.0 license and can be downloaded from the Github page https://github.com/sissa-data-science/DADApy.  ( 2 min )
    Designing Robust Biotechnological Processes Regarding Variabilities using Multi-Objective Optimization Applied to a Biopharmaceutical Seed Train Design. (arXiv:2205.03261v1 [cs.LG])
    Development and optimization of biopharmaceutical production processes with cell cultures is cost- and time-consuming and often performed rather empirically. Efficient optimization of multiple-objectives like process time, viable cell density, number of operating steps & cultivation scales, required medium, amount of product as well as product quality depicts a promising approach. This contribution presents a workflow which couples uncertainty-based upstream simulation and Bayes optimization using Gaussian processes. Its application is demonstrated in a simulation case study for a relevant industrial task in process development, the design of a robust cell culture expansion process (seed train), meaning that despite uncertainties and variabilities concerning cell growth, low variations of viable cell density during the seed train are obtained. Compared to a non-optimized reference seed train, the optimized process showed much lower deviation rates regarding viable cell densities (<~10% instead of 41.7%) using 5 or 4 shake flask scales and seed train duration could be reduced by 56 h from 576 h to 520 h. Overall, it is shown that applying Bayes optimization allows for optimization of a multi-objective optimization function with several optimizable input variables and under a considerable amount of constraints with a low computational effort. This approach provides the potential to be used in form of a decision tool, e.g. for the choice of an optimal and robust seed train design or for further optimization tasks within process development.  ( 2 min )
    Learning Optimal Conformal Classifiers. (arXiv:2110.09192v3 [cs.LG] UPDATED)
    Modern deep learning based classifiers show very high accuracy on test data but this does not provide sufficient guarantees for safe deployment, especially in high-stake AI applications such as medical diagnosis. Usually, predictions are obtained without a reliable uncertainty estimate or a formal guarantee. Conformal prediction (CP) addresses these issues by using the classifier's predictions, e.g., its probability estimates, to predict confidence sets containing the true class with a user-specified probability. However, using CP as a separate processing step after training prevents the underlying model from adapting to the prediction of confidence sets. Thus, this paper explores strategies to differentiate through CP during training with the goal of training model with the conformal wrapper end-to-end. In our approach, conformal training (ConfTr), we specifically "simulate" conformalization on mini-batches during training. Compared to standard training, ConfTr reduces the average confidence set size (inefficiency) of state-of-the-art CP methods applied after training. Moreover, it allows to "shape" the confidence sets predicted at test time, which is difficult for standard CP. On experiments with several datasets, we show ConfTr can influence how inefficiency is distributed across classes, or guide the composition of confidence sets in terms of the included classes, while retaining the guarantees offered by CP.  ( 2 min )
    Low-rank Tensor Learning with Nonconvex Overlapped Nuclear Norm Regularization. (arXiv:2205.03059v1 [cs.LG])
    Nonconvex regularization has been popularly used in low-rank matrix learning. However, extending it for low-rank tensor learning is still computationally expensive. To address this problem, we develop an efficient solver for use with a nonconvex extension of the overlapped nuclear norm regularizer. Based on the proximal average algorithm, the proposed algorithm can avoid expensive tensor folding/unfolding operations. A special "sparse plus low-rank" structure is maintained throughout the iterations, and allows fast computation of the individual proximal steps. Empirical convergence is further improved with the use of adaptive momentum. We provide convergence guarantees to critical points on smooth losses and also on objectives satisfying the Kurdyka-{\L}ojasiewicz condition. While the optimization problem is nonconvex and nonsmooth, we show that its critical points still have good statistical performance on the tensor completion problem. Experiments on various synthetic and real-world data sets show that the proposed algorithm is efficient in both time and space and more accurate than the existing state-of-the-art.  ( 2 min )
    Probabilistic learning constrained by realizations using a weak formulation of Fourier transform of probability measures. (arXiv:2205.03078v1 [stat.ML])
    This paper deals with the taking into account a given set of realizations as constraints in the Kullback-Leibler minimum principle, which is used as a probabilistic learning algorithm. This permits the effective integration of data into predictive models. We consider the probabilistic learning of a random vector that is made up of either a quantity of interest (unsupervised case) or the couple of the quantity of interest and a control parameter (supervised case). A training set of independent realizations of this random vector is assumed to be given and to be generated with a prior probability measure that is unknown. A target set of realizations of the QoI is available for the two considered cases. The framework is the one of non-Gaussian problems in high dimension. A functional approach is developed on the basis of a weak formulation of the Fourier transform of probability measures (characteristic functions). The construction makes it possible to take into account the target set of realizations of the QoI in the Kullback-Leibler minimum principle. The proposed approach allows for estimating the posterior probability measure of the QoI (unsupervised case) or of the posterior joint probability measure of the QoI with the control parameter (supervised case). The existence and the uniqueness of the posterior probability measure is analyzed for the two cases. The numerical aspects are detailed in order to facilitate the implementation of the proposed method. The presented application in high dimension demonstrates the efficiency and the robustness of the proposed algorithm.  ( 2 min )
    Benchmarking Econometric and Machine Learning Methodologies in Nowcasting. (arXiv:2205.03318v1 [stat.ML])
    Nowcasting can play a key role in giving policymakers timelier insight to data published with a significant time lag, such as final GDP figures. Currently, there are a plethora of methodologies and approaches for practitioners to choose from. However, there lacks a comprehensive comparison of these disparate approaches in terms of predictive performance and characteristics. This paper addresses that deficiency by examining the performance of 12 different methodologies in nowcasting US quarterly GDP growth, including all the methods most commonly employed in nowcasting, as well as some of the most popular traditional machine learning approaches. Performance was assessed on three different tumultuous periods in US economic history: the early 1980s recession, the 2008 financial crisis, and the COVID crisis. The two best performing methodologies in the analysis were long short-term memory artificial neural networks (LSTM) and Bayesian vector autoregression (BVAR). To facilitate further application and testing of each of the examined methodologies, an open-source repository containing boilerplate code that can be applied to different datasets is published alongside the paper, available at: github.com/dhopp1/nowcasting_benchmark.  ( 2 min )
    Scalable computation of prediction intervals for neural networks via matrix sketching. (arXiv:2205.03194v1 [stat.ML])
    Accounting for the uncertainty in the predictions of modern neural networks is a challenging and important task in many domains. Existing algorithms for uncertainty estimation require modifying the model architecture and training procedure (e.g., Bayesian neural networks) or dramatically increase the computational cost of predictions such as approaches based on ensembling. This work proposes a new algorithm that can be applied to a given trained neural network and produces approximate prediction intervals. The method is based on the classical delta method in statistics but achieves computational efficiency by using matrix sketching to approximate the Jacobian matrix. The resulting algorithm is competitive with state-of-the-art approaches for constructing predictive intervals on various regression datasets from the UCI repository.  ( 2 min )
    Differentially Private Generalized Linear Models Revisited. (arXiv:2205.03014v1 [cs.LG])
    We study the problem of $(\epsilon,\delta)$-differentially private learning of linear predictors with convex losses. We provide results for two subclasses of loss functions. The first case is when the loss is smooth and non-negative but not necessarily Lipschitz (such as the squared loss). For this case, we establish an upper bound on the excess population risk of $\tilde{O}\left(\frac{\Vert w^*\Vert}{\sqrt{n}} + \min\left\{\frac{\Vert w^* \Vert^2}{(n\epsilon)^{2/3}},\frac{\sqrt{d}\Vert w^*\Vert^2}{n\epsilon}\right\}\right)$, where $n$ is the number of samples, $d$ is the dimension of the problem, and $w^*$ is the minimizer of the population risk. Apart from the dependence on $\Vert w^\ast\Vert$, our bound is essentially tight in all parameters. In particular, we show a lower bound of $\tilde{\Omega}\left(\frac{1}{\sqrt{n}} + {\min\left\{\frac{\Vert w^*\Vert^{4/3}}{(n\epsilon)^{2/3}}, \frac{\sqrt{d}\Vert w^*\Vert}{n\epsilon}\right\}}\right)$. We also revisit the previously studied case of Lipschitz losses [SSTT20]. For this case, we close the gap in the existing work and show that the optimal rate is (up to log factors) $\Theta\left(\frac{\Vert w^*\Vert}{\sqrt{n}} + \min\left\{\frac{\Vert w^*\Vert}{\sqrt{n\epsilon}},\frac{\sqrt{\text{rank}}\Vert w^*\Vert}{n\epsilon}\right\}\right)$, where $\text{rank}$ is the rank of the design matrix. This improves over existing work in the high privacy regime. Finally, our algorithms involve a private model selection approach that we develop to enable attaining the stated rates without a-priori knowledge of $\Vert w^*\Vert$.  ( 2 min )
    Active Offline Policy Selection. (arXiv:2106.10251v4 [cs.LG] UPDATED)
    This paper addresses the problem of policy selection in domains with abundant logged data, but with a restricted interaction budget. Solving this problem would enable safe evaluation and deployment of offline reinforcement learning policies in industry, robotics, and recommendation domains among others. Several off-policy evaluation (OPE) techniques have been proposed to assess the value of policies using only logged data. However, there is still a big gap between the evaluation by OPE and the full online evaluation. Yet, large amounts of online interactions are often not possible in practice. To overcome this problem, we introduce active offline policy selection - a novel sequential decision approach that combines logged data with online interaction to identify the best policy. We use OPE estimates to warm start the online evaluation. Then, in order to utilize the limited environment interactions wisely we decide which policy to evaluate next based on a Bayesian optimization method with a kernel that represents policy similarity. We use multiple benchmarks, including real-world robotics, with a large number of candidate policies to show that the proposed approach improves upon state-of-the-art OPE estimates and pure online policy evaluation.  ( 2 min )
    Solar: $L_0$ solution path averaging for fast and accurate variable selection in high-dimensional data. (arXiv:2007.15707v3 [stat.ML] UPDATED)
    We propose a new variable selection algorithm, subsample-ordered least-angle regression (solar), and its coordinate descent generalization, solar-cd. Solar re-constructs lasso paths using the $L_0$ norm and averages the resulting solution paths across subsamples. Path averaging retains the ranking information of the informative variables while averaging out sensitivity to high dimensionality, improving variable selection stability, efficiency, and accuracy. We prove that: (i) with a high probability, path averaging perfectly separates informative variables from redundant variables on the average $L_0$ path; (ii) solar variable selection is consistent and accurate; and (iii) the probability that solar omits weak signals is controllable for finite sample size. We also demonstrate that: (i) solar yields, with less than $1/3$ of the lasso computation load, substantial improvements over lasso in terms of the sparsity (64-84\% reduction in redundant variable selection) and accuracy of variable selection; (ii) compared with the lasso safe/strong rule and variable screening, solar largely avoids selection of redundant variables and rejection of informative variables in the presence of complicated dependence structures; (iii) the sparsity and stability of solar conserves residual degrees of freedom for data-splitting hypothesis testing, improving the accuracy of post-selection inference on weak signals with limited $n$; (iv) replacing lasso with solar in bootstrap selection (e.g., bolasso or stability selection) produces a multi-layer variable ranking scheme that improves selection sparsity and ranking accuracy with the computation load of only one lasso realization; and (v) given the computation resources, solar bootstrap selection is substantially faster (98\% lower computation time) than the theoretical maximum speedup for parallelized bootstrap lasso (confirmed by Amdahl's law).  ( 2 min )
    R-GCN: The R Could Stand for Random. (arXiv:2203.02424v2 [cs.LG] UPDATED)
    The inception of the Relational Graph Convolutional Network (R-GCN) marked a milestone in the Semantic Web domain as a widely cited method that generalises end-to-end hierarchical representation learning to Knowledge Graphs (KGs). R-GCNs generate representations for nodes of interest by repeatedly aggregating parameterised, relation-specific transformations of their neighbours. However, in this paper, we argue that the the R-GCN's main contribution lies in this "message passing" paradigm, rather than the learned weights. To this end, we introduce the "Random Relational Graph Convolutional Network" (RR-GCN), which leaves all parameters untrained and thus constructs node embeddings by aggregating randomly transformed random representations from neighbours, i.e., with no learned parameters. We empirically show that RR-GCNs can compete with fully trained R-GCNs in both node classification and link prediction settings.  ( 2 min )
    Belief propagation for permutations, rankings, and partial orders. (arXiv:2110.00513v2 [cs.AI] UPDATED)
    Many datasets give partial information about an ordering or ranking by indicating which team won a game, which item a user prefers, or who infected whom. We define a continuous spin system whose Gibbs distribution is the posterior distribution on permutations, given a probabilistic model of these interactions. Using the cavity method we derive a belief propagation algorithm that computes the marginal distribution of each node's position. In addition, the Bethe free energy lets us approximate the number of linear extensions of a partial order and perform model selection between competing probabilistic models, such as the Bradley-Terry-Luce model of noisy comparisons and its cousins.  ( 2 min )
    What Makes A Good Fisherman? Linear Regression under Self-Selection Bias. (arXiv:2205.03246v1 [math.ST])
    In the classical setting of self-selection, the goal is to learn $k$ models, simultaneously from observations $(x^{(i)}, y^{(i)})$ where $y^{(i)}$ is the output of one of $k$ underlying models on input $x^{(i)}$. In contrast to mixture models, where we observe the output of a randomly selected model, here the observed model depends on the outputs themselves, and is determined by some known selection criterion. For example, we might observe the highest output, the smallest output, or the median output of the $k$ models. In known-index self-selection, the identity of the observed model output is observable; in unknown-index self-selection, it is not. Self-selection has a long history in Econometrics and applications in various theoretical and applied fields, including treatment effect estimation, imitation learning, learning from strategically reported data, and learning from markets at disequilibrium. In this work, we present the first computationally and statistically efficient estimation algorithms for the most standard setting of this problem where the models are linear. In the known-index case, we require poly$(1/\varepsilon, k, d)$ sample and time complexity to estimate all model parameters to accuracy $\varepsilon$ in $d$ dimensions, and can accommodate quite general selection criteria. In the more challenging unknown-index case, even the identifiability of the linear models (from infinitely many samples) was not known. We show three results in this case for the commonly studied $\max$ self-selection criterion: (1) we show that the linear models are indeed identifiable, (2) for general $k$ we provide an algorithm with poly$(d) \exp(\text{poly}(k))$ sample and time complexity to estimate the regression parameters up to error $1/\text{poly}(k)$, and (3) for $k = 2$ we provide an algorithm for any error $\varepsilon$ and poly$(d, 1/\varepsilon)$ sample and time complexity.  ( 2 min )
    Optimally tackling covariate shift in RKHS-based nonparametric regression. (arXiv:2205.02986v1 [math.ST])
    We study the covariate shift problem in the context of nonparametric regression over a reproducing kernel Hilbert space (RKHS). We focus on two natural families of covariate shift problems defined using the likelihood ratios between the source and target distributions. When the likelihood ratios are uniformly bounded, we prove that the kernel ridge regression (KRR) estimator with a carefully chosen regularization parameter is minimax rate-optimal (up to a log factor) for a large family of RKHSs with regular kernel eigenvalues. Interestingly, KRR does not require full knowledge of the likelihood ratio apart from an upper bound on it. In striking contrast to the standard statistical setting without covariate shift, we also demonstrate that a na\"\i ve estimator, which minimizes the empirical risk over the function class, is strictly suboptimal under covariate shift as compared to KRR. We then address the larger class of covariate shift problems where likelihood ratio is possibly unbounded yet has a finite second moment. Here, we show via careful simulations that KRR fails to attain the optimal rate. Instead, we propose a reweighted KRR estimator that weights samples based on a careful truncation of the likelihood ratios. Again, we are able to show that this estimator is minimax optimal, up to logarithmic factors.  ( 2 min )
    Lagrangian PINNs: A causality-conforming solution to failure modes of physics-informed neural networks. (arXiv:2205.02902v1 [cs.LG])
    Physics-informed neural networks (PINNs) leverage neural-networks to find the solutions of partial differential equation (PDE)-constrained optimization problems with initial conditions and boundary conditions as soft constraints. These soft constraints are often considered to be the sources of the complexity in the training phase of PINNs. Here, we demonstrate that the challenge of training (i) persists even when the boundary conditions are strictly enforced, and (ii) is closely related to the Kolmogorov n-width associated with problems demonstrating transport, convection, traveling waves, or moving fronts. Given this realization, we describe the mechanism underlying the training schemes such as those used in eXtended PINNs (XPINN), curriculum regularization, and sequence-to-sequence learning. For an important category of PDEs, i.e., governed by non-linear convection-diffusion equation, we propose reformulating PINNs on a Lagrangian frame of reference, i.e., LPINNs, as a PDE-informed solution. A parallel architecture with two branches is proposed. One branch solves for the state variables on the characteristics, and the second branch solves for the low-dimensional characteristics curves. The proposed architecture conforms to the causality innate to the convection, and leverages the direction of travel of the information in the domain. Finally, we demonstrate that the loss landscapes of LPINNs are less sensitive to the so-called "complexity" of the problems, compared to those in the traditional PINNs in the Eulerian framework.  ( 2 min )
    Explainable multi-class anomaly detection on functional data. (arXiv:2205.02935v1 [stat.ML])
    In this paper we describe an approach for anomaly detection and its explainability in multivariate functional data. The anomaly detection procedure consists of transforming the series into a vector of features and using an Isolation forest algorithm. The explainable procedure is based on the computation of the SHAP coefficients and on the use of a supervised decision tree. We apply it on simulated data to measure the performance of our method and on real data coming from industry.  ( 2 min )

  • Open

    Hello, is there any suggestions of scalable multiagent RL environements ( I want to be able to change the number of agents in the env). Thank you
    submitted by /u/Ok_Lab_2750 [link] [comments]  ( 1 min )
    "BARL: An Experimental Design Perspective on Model-Based Reinforcement Learning" (on Mehta et al 2021)
    submitted by /u/gwern [link] [comments]  ( 1 min )
    dialogue history in dialogue generation
    what is dialogue history in dialogue generation in the paper titled "deep Reinforcement learning for Dialogue generation" in section 4.3 https://arxiv.org/abs/1606.01541 submitted by /u/Western-Age3148 [link] [comments]  ( 1 min )
    I’ve published a new repo with A collection of Deep RL algorithms implemented with PyTorch
    submitted by /u/Top_Serve_2348 [link] [comments]
    Will training in multi agent reinforcement learning converge? Assume there are two agents, "A get stronger, B learn from errors, B get stronger, A learn from errors so on .....", will this happen?
    submitted by /u/Professional_Card176 [link] [comments]  ( 1 min )
    reinforcement learning in finance and trading
    how is it different than time series analysis or since it has enormous amount of data to analyze, is it worth using RL in this field... if so plz link the github repository containing RL papers on finance and trading submitted by /u/Western-Age3148 [link] [comments]  ( 1 min )
    Question about Vectorized Environments and GPU/Cuda training
    I have a bunch of questions that I can't seem to find any good answers to, so I might as well ask them here. I'm learning stablebaselines3 by creating a snake game. Pretty simple. I've enabled cuda training, so that I can use my 1080Ti. When I vectorize my custom environment, I do see some efficiency gains in training, but not many (using PPO model): 1,000,000 timestep training: n_envs = 4 --> 20 minutes n_envs = 12 --> 16 minutes n_envs = 100 --> 13 minutes ​ The time efficiency does not scale linearly with the amount of environments running concurrently. So I'm guessing that either I'm doing something wrong or perhaps the efficiency gains aren't that impressive in general. It seems that the 1080 Ti has somewhere around 3000 CUDA cores. Does that mean I can run 3000 concurrent environments? Also... after 1 million timesteps the snake AI is still pretty dumb. Is this normal, or should I see relatively good results after 1 million timesteps, and if I don't that's an indication that my reward system or something else isn't good? Thanks! submitted by /u/HaikuHaiku [link] [comments]  ( 1 min )
  • Open

    Top 10 Leading Universities in AI Research
    Deep learning, natural language processing, data analytics, and big-data mining are fields of Artificial Intelligence (AI), and many companies are looking for professionals in these fields. A professional degree in AI from a reputed university will help you get started in this industry. Read more submitted by /u/ridamughal110 [link] [comments]
    Is AI cost-effective?
    Artificial Intelligence (AI) has the potential to disrupt the global economy by bringing significant changes to the industry. Corporations are recognizing AI's competitive edge, and as a result, a growing number of companies are implementing AI technology and reaping significant benefits. Read more submitted by /u/ridamughal110 [link] [comments]
    Overview of over 160 Biases (Belief, decision-making & behavioral, Social, Memory)
    Cognitive biases are an essential and serious issue. Especially for people who deal with data (algorithms). I've compiled a list (pdf/EPUB) of over 160 biases (mainly from Wikipedia). Maybe this is useful for some. These biases affect belief formation, reasoning processes, business & economic decisions, and human behavior in general. Let's learn more about our human biases to make less biased conclusions in the future. The PDF/EPUB can be downloaded for free on leanpub: Cognitive Biases: A Brief Overview of Over 160 Cognitive Biases submitted by /u/Philo167 [link] [comments]  ( 1 min )
    Poker AI Plays Itself, Match 3
    submitted by /u/bluboxsw [link] [comments]
    Alibaba Introduces ‘FederatedScope’: An Easy-To-Use Federated Learning Platform Providing Comprehensive Functionalities
    Federated learning is a machine learning technique that trains a model over several dispersed nodes or hosts, as the name suggests. Each node utilizes its own training data. If the model parameters are shared between nodes rather than the raw data, the data can be kept private. Due to privacy concerns, obtaining training data to design and evolve machine learning models is increasingly being questioned, and federated learning can help alleviate some of these issues. The Chinese e-commerce behemoth, Alibaba, has created a federated learning platform that allows machine learning algorithms to be constructed without sharing training data. Continue Reading Github: https://github.com/alibaba/FederatedScope submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Uniting human brains and computers: A new type of AI
    submitted by /u/bendee983 [link] [comments]
    The ML pipeline & Types of roles in ML
    submitted by /u/mr-minion [link] [comments]
    Presentation of Project: The Ark Evolving
    I am a programmer of intermediate knowledge and this is a project in development only by me, it is based on creating and executing an artificial intelligence to play the Magic Arena card game in an advanced way, the AI ​​will have supervised learning using the minimax algorithm. At the moment I have the concept, and that is why I am making this post to ask you for advice and recommendations that you can give me. submitted by /u/Merelex_Buk-12 [link] [comments]  ( 1 min )
    Iterative to Launch Open Source Tool, First to Train Machine Learning Models on Any Cloud Using HashiCorp’s Terraform Solution
    submitted by /u/thumbsdrivesmecrazy [link] [comments]
    Apple loses Ian Goodfellow, director of machine learning, inventor of GANs, over return-to-office policy
    submitted by /u/Zirius_Sadfaces [link] [comments]  ( 1 min )
    Watch this model explain, debug, document + generate test cases and answer questions about your code!
    submitted by /u/landongarrison [link] [comments]  ( 1 min )
  • Open

    [D] Inequality of the paper [On the uncertainty principle of neural networks]
    I'm reading the paper [On the uncertainty principle of neural networks](https://arxiv.org/abs/2205.01493), and i'm doubting one inequality. In the middle of the equation (7) and (8), there's a part that states If we use the property ~~~ ​ https://preview.redd.it/tz6k5htt0cy81.png?width=774&format=png&auto=webp&s=99fc8b5701e00fdf253e2f52eb2fde240772fe5e Is the yellow inequality true? I think the inequality should be opposite, due to arithmetic–geometric mean inequality. Thks. submitted by /u/jryoungw2035 [link] [comments]  ( 1 min )
    [News] This “amateur” programmer once used 50 sheets of 1080Ti to fight cancer
    submitted by /u/coolwulf [link] [comments]  ( 1 min )
    [P] [python] Trading Bot! Our team is looking for 2 new team members! We already done a lot! Info in text :)
    Hello, Our team has four team members based in Europe, 27-39 yrs old. We are doing the trading bot from the scratch based on strategy related to momentum and fractal geometry. I was manually trading that strategy for 3 or more years and made it unique. Live trader (the bot) is already built, but it requires more development. Backtrader is 90% completed. Results are promising and that is what keep us pushing the project. All team members share the source code as our reward. We decided to acquire 1 or 2 team members. Please feel free to approach me, Thanks! submitted by /u/Prior_Regret_4138 [link] [comments]  ( 1 min )
    [D] Machine Learning - WAYR (What Are You Reading) - Week 137
    This is a place to share machine learning research papers, journals, and articles that you're reading this week. If it relates to what you're researching, by all means elaborate and give us your insight, otherwise it could just be an interesting paper you've read. Please try to provide some insight from your understanding and please don't post things which are present in wiki. Preferably you should link the arxiv page (not the PDF, you can easily access the PDF from the summary page but not the other way around) or any other pertinent links. Previous weeks : 1-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100 101-110 111-120 121-130 131-140 Week 1 Week 11 Week 21 Week 31 Week 41 Week 51 Week 61 Week 71 Week 81 Week 91 Week 101 Week 111 Week 121 Week 131 Week 2 Week 12 Week 22 Week 32 Week 42 Week 52 Week 62 Week 72 Week 82 Week 92 Week 102 Week 112 Week 122 Week 132 Week 3 Week 13 Week 23 Week 33 Week 43 Week 53 Week 63 Week 73 Week 83 Week 93 Week 103 Week 113 Week 123 Week 133 Week 4 Week 14 Week 24 Week 34 Week 44 Week 54 Week 64 Week 74 Week 84 Week 94 Week 104 Week 114 Week 124 Week 134 Week 5 Week 15 Week 25 Week 35 Week 45 Week 55 Week 65 Week 75 Week 85 Week 95 Week 105 Week 115 Week 125 Week 135 Week 6 Week 16 Week 26 Week 36 Week 46 Week 56 Week 66 Week 76 Week 86 Week 96 Week 106 Week 116 Week 126 Week 136 Week 7 Week 17 Week 27 Week 37 Week 47 Week 57 Week 67 Week 77 Week 87 Week 97 Week 107 Week 117 Week 127 Week 8 Week 18 Week 28 Week 38 Week 48 Week 58 Week 68 Week 78 Week 88 Week 98 Week 108 Week 118 Week 128 Week 9 Week 19 Week 29 Week 39 Week 49 Week 59 Week 69 Week 79 Week 89 Week 99 Week 109 Week 119 Week 129 Week 10 Week 20 Week 30 Week 40 Week 50 Week 60 Week 70 Week 80 Week 90 Week 100 Week 110 Week 120 Week 130 Most upvoted papers two weeks ago: /u/CatalyzeX_code_bot: Paper link Besides that, there are no rules, have fun. submitted by /u/ML_WAYR_bot [link] [comments]  ( 1 min )
    [D] Is WGAN-GP gradient penalty applicable to the generator?
    I am currently studying a set of papers on GAN-driven Video-to-Speech synthesis, and one of those in particular has got me scratching my head in confusion. The study describes an architecture based on WGAN-GP, a modification of WGAN which aims to enforce 1-Lipschitz constraint by means of a gradient penalty. Specifically, the original paper presents an algorithm in which said penalty is applied to the critic, keeping the generator's loss function relatively simple. In this paper, however, the loss functions specified show the penalty applied to the generator instead, with the critic's loss function being identical to that of a standard WGAN. They do, however, state right below that " The coefficient for the gradient penalty λ is kept at the value of 10 for both critics..." https://preview.redd.it/gdp35hihsay81.png?width=793&format=png&auto=webp&s=7eedf0103ee3b33afe991148ac5a7552d262f35f So the question is: which is it? Should I consider this to be a typo of some sort, or am I not understanding something in this paper? I can understand how keeping the function's gradient norm below 1 may help when applied to the critic, but what benefit could it have for the generator? Are the other loss functions used in the paper somehow relevant perhaps? submitted by /u/ShujiMikami [link] [comments]  ( 1 min )
    [D] Looking for a python library that implements decision tree regressors handling categorical features
    Hello community, I am a bit baffled scikit-learn does not support this. I am looking for a good python library that enables fitting a decision tree regressor on both numerical and categorical features (non-binary tree). Could you point me to one if you know any, please? Thanks! (PS: This is for visualization and interpretability, so things like catboost won't do) EDIT: I think I may have found what I was looking for: chefboost (PS: unfortunately it is a bit simplistic and does not support pruning atm) Also xMattC3 pointed this ongoing PR for scikit learn: NOCATS submitted by /u/yannbouteiller [link] [comments]  ( 2 min )
    [P] I’ve been trying to understand the limits of some of the available machine learning models out there. Built an app that lets you try a mix of CLIP from Open AI + Apple’s version of MobileNet, and more directly on your phone's camera roll.
    submitted by /u/Playgroundai [link] [comments]  ( 1 min )
    [N] We're sharing details on new methods to more efficiently train Vision Transformers, a model architecture that can achieve state-of-the-art results on a wide range of computer vision tasks.
    submitted by /u/AbjectDrink3276 [link] [comments]  ( 1 min )
    [D] Simple Questions Thread
    Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]  ( 1 min )
    [P] Comparison of NLP Sentiment Analysis ML models - VADER vs roBERTa
    submitted by /u/robikscuber [link] [comments]
    [N] Deepmind's Flamingo combines speech and vision
    submitted by /u/much_successes [link] [comments]
    [D] Dynamic Time Warping to align sequence to part of a longer sequence
    I have a dataset of audio files, and I wish to create a search functionality in the audio space: given a short fragment containing only one spoken word, I want to find the exact location of that word in a longer video containing multiple words. Let's say I want to avoid audio transcription for some reason. So I want to find the exact location within the audio without knowing which word is used, but just purely based on the waveform. I could try doing this with a sliding window, but I thought Dynamic Time Warping (DTW) would be an interesting alternative. However, DTW requires that the start and end point of both sequences are the same. This isn't the case here, as the query sequence is a subset of the longer sequence (or may even not be present in the longer sequence). Also, finding the start and end points in the longer sequence is kind of the goal, so it's not useful as a prior. Are there any algorithms similar to DTW or any approaches I should know about? submitted by /u/RaptorDotCpp [link] [comments]  ( 3 min )
    [D] AWS sagemaker vs similar tool which can run on any cloud
    Hi all, I've been a user of AWS Sagemaker for quite some time. Now we got some other cloud credits. I want a multi-cloud solution that can help me run training jobs. I'm planning on this approach. Have some multi-cloud object storage. Save all my data here. Will MiniO help? Use Terraform to create configuration files to create nodes on any cloud. How difficult will this be? Write a simple configuration file that can take cloud provider, instance types, training docker, and storage path and run a training job Is this implemented already? I already am exploring some serverless trainings. Any thoughts on this? Any one with the same problem who would like to contribute? submitted by /u/scb_11 [link] [comments]  ( 1 min )
    [P] Optimizing ML Workloads with TPI (Terraform Provider Iterative) - One Provider to Rule AWS, Azure, GCP, K8s
    The guide introduces Terraform Provider Iterative (TPI) - an open-source tool extending the functionality of Terraform. The tool enables full lifecycle management of computing resources and is designed specifically for ML pipelines: Machine Learning Workloads with Terraform Provider Iterative It was designed for machine learning (ML/AI) teams and optimizes CPU/GPU expenses. TPI unifies auto-scaling groups for all the major cloud providers: AWS, Azure, GCP and Kubernetes. Spot instances auto-recovery (if an instance was evicted/terminated) with data and checkpoint synchronization Auto-terminate instances when ML training is finished - you won't forget to terminate your expensive GPU instance for a week :) Using Terraform commands and config (HCL) submitted by /u/cmstrump [link] [comments]  ( 1 min )
    [D] Tips for reviewing at top A* conferences
    I'll be reviewing the paper at a top conference for first time this year so I was wondering if you guys can share some tips on how to be a good reviewer and how you guys typically review a submission at any of the top A* conferences. submitted by /u/furious_madman_123 [link] [comments]  ( 1 min )
    [N] Ian Goodfellow, Apple’s director of machine learning, is leaving the company due to its return to work policy. In a note to staff, he said “I believe strongly that more flexibility would have been the best policy for my team.” He was likely the company’s most cited ML expert.
    submitted by /u/hardmaru [link] [comments]  ( 4 min )
    [D] How much time do you spend transforming data before training a model?
    I am looking for straightforward ways, if they exist, to check for data transformation opportunities for different datasets. submitted by /u/sphinx00777 [link] [comments]
  • Open

    Need help in understanding role of activation functions.
    This question is being asked here after i had a discord with a professor of mine. As far as i understood it, activation functions are used to make neural networks non linear but he believes activates and de activates individual nodes of a NN. I cant get my head around how it activates a neuron. Without using a threshold function. Taking the case of a neuron with a sigmoid activation the output will never be 0, though miniscule in case when x is a large negetive number, but still not 0. I tried to find the answer online but found sites having different opinions. submitted by /u/Pritesh190801 [link] [comments]  ( 2 min )
    The ML pipeline & Types of roles in ML
    submitted by /u/mr-minion [link] [comments]
    Defeating Amazon Rekognition's face detection
    Amazon's IT department, also known as Amazon Web Services, or AWS, developed a computer vision platform called Rekognition. https://aws.amazon.com/rekognition/ Rekognition is a cloud-base software as a service that does image and video analysis. This system does a lot, however my focus is on face recognition of people (not only users of the system, but anyone). AWS provides cloud services, which means they have infinite amount of space on their servers. There are claims that I couldn't confirm (so far) on Amazon getting data of people from not only social media sources, but also from sources that are officially not public. If the data is saved on AWS cloud services, Amazon practically has this data. Amazon takes pride in a long list of customers https://aws.amazon.com/rekognition/cu…  ( 2 min )
  • Open

    Network security guidelines for 2022
    Cyber security has become the buzzword in the tech-savvy world where the entire world hinges on digital platforms. It has become very critical for individuals to become cyber aware. People are thereby developing security tips to help individuals manage their digital presence. Whether for personal use or professional undertaking, cyber security has become a significant… Read More »Network security guidelines for 2022 The post Network security guidelines for 2022 appeared first on Data Science Central.  ( 3 min )
    Understanding EXIF Data and How to View It on Android Phones
    Mοst mοdеrn-day smartphοnеs with grеat camеras makе it rеally еasy fοr anyοnе and еvеryοnе tο click gοοd phοtοs. But what mοst pеοplе dοn’t rеalizе is that with еach phοtοgraph, thеy’rе alsο capturing an lοt οf pеrsοnal infοrmatiοn which, whеn thеy sharе that phοtο οn sοcialmеdia, bеcοmеs availablе tο a much widеrsеt οf pеοplе οn thе… Read More »Understanding EXIF Data and How to View It on Android Phones The post Understanding EXIF Data and How to View It on Android Phones appeared first on Data Science Central.  ( 3 min )
    AI In Manufacturing: Know How Latest Intelligence Reshaping the Industries with Speed and Accuracy
    Last few years ago, the industrial revolution is the most popular evolution ever faced by the industrial sector. It encompasses all the latest technology trends which affecting industries over the world. Autonomous cars, smart connected devices, sensors, computer chips, and many other technologies represented this transformation. This happens due to the manufacturing industry has been… Read More »AI In Manufacturing: Know How Latest Intelligence Reshaping the Industries with Speed and Accuracy The post AI In Manufacturing: Know How Latest Intelligence Reshaping the Industries with Speed and Accuracy appeared first on Data Science Central.  ( 3 min )
    Social Media, Cyber Bullying, and Need For Content Moderation
    A rise in netizens and social media platforms, coupled with the growth of the mobile internet, has resulted in a jump in the creation and consumption of User Generated Content (UGC). Social media platforms, in all honesty, have become a major channel for disseminating, circulating, and exchanging information to billions of people on the internet… Read More »Social Media, Cyber Bullying, and Need For Content Moderation The post Social Media, Cyber Bullying, and Need For Content Moderation appeared first on Data Science Central.  ( 6 min )
    Comparing Cloud Telephony With On-Premise Call Centers: An Upgrade
    Contact center dynamics are changing as well as the customer engagement landscape. Due to the growing customer expectations and time constraints, cloud call center solutions are more important than ever before. With technology, businesses and agents can offer omnichannel customer service, and customers can handle most queries themselves. In order to achieve high omnichannel engagement,… Read More »Comparing Cloud Telephony With On-Premise Call Centers: An Upgrade The post Comparing Cloud Telephony With On-Premise Call Centers: An Upgrade appeared first on Data Science Central.  ( 4 min )
    The long game: Desiloed systems and feedback loops by design (I of II)
    Data management (DM) discussions can be frustrating because both those feeling the pain and the consultants who try to help them are–90+ percent of the time, it seems–still using the same old ways. Those ways only go so far, and won’t go any farther. That’s because those who reinforce the old ways assume that what… Read More »The long game: Desiloed systems and feedback loops by design (I of II) The post The long game: Desiloed systems and feedback loops by design (I of II) appeared first on Data Science Central.  ( 5 min )
    The Graph of Thrones: the Now Graph and Eternal Graph in RDF-Star Modeling
    Warning: This is going to get heavy into Turtle code, but I think there’s enough here for it to be worth reading if you are involved in knowledge graph work. I’ve been working with knowledge graphs a lot lately, and a conversation that I had with a few other ontologists has been resonating in my… Read More »The Graph of Thrones: the Now Graph and Eternal Graph in RDF-Star Modeling The post The Graph of Thrones: the Now Graph and Eternal Graph in RDF-Star Modeling appeared first on Data Science Central.  ( 9 min )
  • Open

    Driving Experimentation Forward through a Working Group (Experimentation Program Series: Guide 03)
    In my previous post, I defined an experimentation program (ExPr) as the mechanism by which a company uses randomized controlled experiments to generate positive business results. An ExPr is composed of the people, processes, and infrastructure for running experiments at… Read More The post Driving Experimentation Forward through a Working Group (Experimentation Program Series: Guide 03) appeared first on ML in Production.  ( 7 min )
  • Open

    Driving Experimentation Forward through a Working Group (Experimentation Program Series: Guide 03)
    In my previous post, I defined an experimentation program (ExPr) as the mechanism by which a company uses randomized controlled experiments to generate positive business results. An ExPr is composed of the people, processes, and infrastructure for running experiments at… Read More The post Driving Experimentation Forward through a Working Group (Experimentation Program Series: Guide 03) appeared first on ML in Production.  ( 7 min )

  • Open

    Any interesting “less obvious” real-world applications of RL
    Reinforcement learning has immediate applications in industrial robotics and other control oriented tasks. Are there any interesting real-world applications of RL that is less obvious than robotics or trading? I saw one application in the crypto space and am curious about the other different possible applications (can be in any sector) of RL submitted by /u/blitzkreig3 [link] [comments]  ( 1 min )
    No module named 'gym.envs.classic_control.rendering' . How do I fix it?
    I ran into it when running retro.examples.random_agent . submitted by /u/sirThomasmacAbyss [link] [comments]  ( 1 min )
    Reasonable training result, but how to improve further?
    Hi all, I have a 4 dof robot. I am trying to teach this specifical movement: "Whenever you move, dont move joint 1 (orange in the plot) at the same time with joint 2, 3, 4". The corresponding reward function is: reward= 1/( abs(torque_q1) * max(abs(torque_q2) , abs(torque_q3), abs(torque_q4) ) As the plot shows, the learned policy somehow reprocues the intended movement: first q1 movement and the other joints. But the part that I want to improve is around at t=13. There q1 gradually decreases and the other joints gradually start to move. Is there a way to improve this so that there is a complete stop of q1 movement and then the other joints start to move? ​ https://preview.redd.it/do8gaqyhm3y81.png?width=2892&format=png&auto=webp&s=2bbc13523e9043de7a6316f3edd2e2741e910cd0 submitted by /u/Fun-Moose-3841 [link] [comments]  ( 1 min )
    Is it possible to train the Reinforcement Learning agent using a data set (in order to make use of human expertise in a game set for example) to reinforce the training (like a combination between reinforcement learning and supervised learning )?
    submitted by /u/Ok_Lab_2750 [link] [comments]  ( 2 min )
    text summarization using RL
    is it worth to work on text summarization even though big techs are already working on this problem and already achieved good results. Can someone suggest the topic which can be explored using techniques of RL, in the domain of NLP submitted by /u/Western-Age3148 [link] [comments]  ( 1 min )
  • Open

    [P] Markov chain with unequal sequence lengths
    I'm trying to build a simple Markov chain. I have data from therapy notes, where the therapist selects the overall topic of the session out a list of 27 possible topics. The problem is that not every therapist tags their session topic consistently, and all clients have a different number of sessions. I'm trying to build a simple Markov chain to get probabilities of topic transition between sessions, but as you can see, the data are complicated. I was wondering if anyone has encountered a situation where there were uneven sequence lengths per observation/case and how did you go about building a Markov chain in this case? Thanks! submitted by /u/sebelly [link] [comments]  ( 1 min )
    WACV 2023? [R]
    I don't see a website or any info for WACV 2023. Do you think it will still be hosted, and what would be the expected submission deadline for round 1? submitted by /u/avd4292 [link] [comments]
    [D] The state of reverse summarization/highly conditioned text gen
    Hi all, I’m looking to do a bit of research into highly conditioned text generation, specifically reverse summarization tasks. This type of task basically entails the reverse of what models like Pegasus aim to accomplish; instead of summarizing long text, the goal is to generate the long text from a summary. I’m curious if there’s been any work in this before or if there is an existing state of the art? It’s impossible to search for this online since summarization is so popular that everything I can find just entails going from long text to summary. I’d especially be curious about attempts at this task on conversational data (e.g. the reverse of SAMSum) or news articles (the reverse of CNN/DailyMail). submitted by /u/woodworksio [link] [comments]  ( 1 min )
    Good dataset for lasso regression [P]
    Hello everyone, I'm currently doing a memoir on accelerated proximal point algorithm and I need to prove the efficiency of my accelerated algorithm with a lasso regression. The thing is I need to have a dataset which has a lot of features ( to show the efficiency of lasso) like between 1000 and 2000 , do you know where I can download such dataset for Lasso Regression, that would be a huge help Thank You!!! submitted by /u/Jeffrosslostson [link] [comments]  ( 1 min )
    [D] An End-to-End MLOps Platform Implementation using Open-source Tooling
    submitted by /u/ponderinghydrogen [link] [comments]
    [R] Fix the Noise: Disentangling Source Feature for Transfer Learning of StyleGAN (CVPRW 2022)
    submitted by /u/Cautious-Ad1373 [link] [comments]
  • Open

    AI News | Breast Cancer Detection Beats Humans | AI Learns Concepts Across Video, Audio & Text | AI Music Hit Song Detection
    submitted by /u/getrich_or_diemining [link] [comments]  ( 1 min )
  • Open

    Mentally calculating the day of the week
    In my previous post I mentioned John Conway’s Doomsday rule for calculating the day of the week for any date. This method starts off very simple, but gets more complicated when you actually use it. This post will present an alternative method that’s easier to use in practice and can be described more succinctly. Here’s […] Mentally calculating the day of the week first appeared on John D. Cook.  ( 3 min )
  • Open

    Determine whether an image, from a given sequence of images, has changed considerably
    Problem Statement: You have a sequence of images for a room that contains three important components: sink, piping, walls, and fire alarm. An image is taken every week. That means you have an initial image taken on week 0 and an image I taken on week I. The room is undergoing construction and each week you have to determine whether the room has undergone sufficient change. This process must continue until the completion of work on the room. There is no restriction on the perspective from which the image would be taken. So, for example, the image from week A would be of the room taken while standing in the right doorway while the image taken from week A+1 could be taken while standing in the left doorway. You have to determine whether the three important components (sink, piping, walls, …  ( 2 min )
2022-06-06T00:58:52.391Z osmosfeed 1.14.4